Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
Pith reviewed 2026-05-18 18:03 UTC · model grok-4.3
The pith
Frame-wise expert routing on self-supervised features lets one model improve both speech enhancement and emotion recognition in noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse MERIT uses task-specific gating networks to perform frame-wise dynamic selection from a shared expert pool applied to self-supervised speech representations, enabling joint optimization of speech enhancement and speech emotion recognition that reduces gradient interference and representational conflicts.
What carries the argument
Sparse MERIT: mixture-of-experts architecture with frame-wise routing through task-specific gating networks over a shared expert pool for adaptive representation learning.
If this is right
- At -5 dB SNR the method raises SER F1-macro by 12.0 percent over a separate enhancement baseline and 3.4 percent over naive multi-task learning, with gains holding on unseen noises.
- Segmental SNR for the enhancement task rises 28.2 percent over the pre-processing baseline and 20.0 percent over the naive multi-task baseline.
- Both tasks improve at the same time rather than trading off performance.
- The routing remains effective when the test noises differ from those seen during training.
Where Pith is reading between the lines
- The same routing idea could be tested on other paired audio tasks such as dereverberation paired with speaker verification.
- Increasing the number of experts while keeping the gating sparse might further reduce conflicts if extra compute is available.
- Replacing the current self-supervised front-end with a different pretrained encoder would test whether the routing benefit depends on the specific input features.
Load-bearing premise
Dynamic frame-level routing through a shared expert pool can separate the needs of enhancement and emotion recognition without dropping task-critical details from the input features.
What would settle it
If a non-routed shared-backbone model trained on the same data and self-supervised features matches or exceeds Sparse MERIT on both F1-macro and segmental SNR at -5 dB SNR on unseen noise, the benefit of the mixture routing would be called into question.
Figures
read the original abstract
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning. Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks. Under the most challenging condition of -5 dB signal-to-noise ratio (SNR), Sparse MERIT improves SER F1-macro by an average of 12.0% over a baseline relying on a SE pre-processing strategy, and by 3.4% over a naive MTL baseline, with statistical significance on unseen noise conditions. For SE, Sparse MERIT improves segmental SNR (SSNR) by 28.2% over the SE pre-processing baseline and by 20.0% over the naive MTL baseline. These results demonstrate that Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Sparse MERIT, a multi-task learning framework that uses a sparse mixture-of-experts architecture with frame-wise dynamic routing over self-supervised speech representations to jointly perform speech enhancement (SE) and speech emotion recognition (SER). Task-specific gating networks select experts from a shared pool to mitigate gradient interference and representational conflicts between the tasks. Experiments on the MSP-Podcast corpus under noisy conditions, including unseen noise at low SNRs, report consistent gains over SE pre-processing and naive MTL baselines, with specific improvements at -5 dB SNR of 12.0% in SER F1-macro and 28.2% in segmental SNR (SSNR).
Significance. If the reported gains are reproducible, the work offers a practical advance for robust emotion recognition in noisy environments by avoiding artifacts from separate enhancement stages and reducing task conflicts in joint training. The parameter-efficient MoE routing on self-supervised features is a timely contribution to multi-task speech processing, and the evaluation on public data with statistical significance claims on held-out conditions strengthens the case for generalizability.
major comments (2)
- [Abstract and §4] Abstract and §4: The performance claims (e.g., 12.0% SER F1-macro gain and 28.2% SSNR gain at -5 dB SNR) are load-bearing for the central contribution, yet the manuscript provides no details on exact baseline implementations, hyperparameter search procedures, or training protocols for the SE pre-processing and naive MTL baselines; this limits independent verification of the improvements.
- [§3.2] §3.2: The frame-wise routing mechanism is described as resolving representational conflicts without losing task-critical information, but the paper does not include an ablation isolating the effect of sparsity or dynamic selection versus a dense shared backbone; without this, it is unclear whether the gains stem from the proposed routing or from other factors such as increased capacity.
minor comments (2)
- [§3] Notation for the gating network and expert selection should be clarified with explicit equations to distinguish the task-specific gates from the shared expert pool.
- [§5] Figure captions and axis labels in the results section could be expanded to indicate the exact noise types and SNR levels used in each panel for easier interpretation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: The performance claims (e.g., 12.0% SER F1-macro gain and 28.2% SSNR gain at -5 dB SNR) are load-bearing for the central contribution, yet the manuscript provides no details on exact baseline implementations, hyperparameter search procedures, or training protocols for the SE pre-processing and naive MTL baselines; this limits independent verification of the improvements.
Authors: We agree that greater detail on the baselines is required for reproducibility. In the revised manuscript we will expand §4 to specify the exact architectures and training configurations of the SE pre-processing models, the naive MTL baseline, the hyperparameter search procedure (including ranges and selection method), and the full training protocols (optimizer, learning-rate schedule, batch size, and epochs). These additions will appear in the main text or supplementary material as space allows. revision: yes
-
Referee: [§3.2] §3.2: The frame-wise routing mechanism is described as resolving representational conflicts without losing task-critical information, but the paper does not include an ablation isolating the effect of sparsity or dynamic selection versus a dense shared backbone; without this, it is unclear whether the gains stem from the proposed routing or from other factors such as increased capacity.
Authors: The referee correctly notes the absence of a direct ablation against a capacity-matched dense backbone. We will add this comparison in the revised version, reporting results for a dense shared-backbone model with parameter count comparable to Sparse MERIT. The new ablation will quantify the contribution of frame-wise sparsity and task-specific gating to the observed gains and will be discussed in §4. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces Sparse MERIT as an empirical MTL architecture with frame-wise MoE routing over self-supervised features for joint SE and SER. All reported results consist of measured performance deltas (F1-macro, SSNR) on the external MSP-Podcast corpus under controlled noisy conditions, including unseen noise, with explicit baseline comparisons and statistical significance. No equations, uniqueness theorems, or first-principles derivations appear in the provided text; the method is presented as a proposed framework whose value is established by external validation rather than by reducing any quantity to a fitted parameter or self-citation by construction. The reader's assessment of score 2.0 is consistent with this self-contained empirical structure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised speech representations contain sufficient information for both enhancement and emotion recognition tasks
invented entities (1)
-
Sparse MERIT framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
frame-wise expert routing over self-supervised speech representations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
E. K. Zadeh and M. Alaeifard, “Adaptive virtual assistant interaction through real-time speech emotion analysis using hybrid deep learning models and contextual awareness,”International Journal of Advanced Human Computer Interaction, vol. 1, no. 1, pp. 1–15, 2023
work page 2023
-
[2]
Real-time speech emotion analysis for smart home assistants,
R. Chatterjee, S. Mazumdar, R. S. Sherratt, R. Halder, T. Maitra, and D. Giri, “Real-time speech emotion analysis for smart home assistants,” IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021
work page 2021
-
[3]
P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, “Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,”IEEE Access, vol. 8, pp. 62 032–62 041, 2020
work page 2020
-
[4]
Multilayer neural network based speech emotion recognition for smart assistance
S. Kumar, M. A. Haq, A. Jain, C. A. Jason, N. R. Moparthi, N. Mittal, and Z. S. Alzamil, “Multilayer neural network based speech emotion recognition for smart assistance.”Computers, Materials & Continua, vol. 75, no. 1, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
work page 2023
-
[5]
Speech emotion recognition using supervised deep recurrent system for mental health monitoring,
N. Elsayed, Z. ElSayed, N. Asadizanjani, M. Ozer, A. Abdelgawad, and M. Bayoumi, “Speech emotion recognition using supervised deep recurrent system for mental health monitoring,” in2022 IEEE 8th World Forum on Internet of Things (WF-IoT), 2022, pp. 1–6
work page 2022
-
[6]
S. Madanian, D. Parry, O. Adeleye, C. Poellabauer, F. Mirza, S. Mathew, and S. Schneider, “Automatic speech emotion recognition using machine learning: digital transformation of mental health,” inProceedings of the Annual Pacific Asia Conference on Information Systems (PACIS), 2022
work page 2022
-
[7]
Depression severity classification from speech emotion,
S. Harati, A. Crowell, H. Mayberg, and S. Nemati, “Depression severity classification from speech emotion,” in2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2018, pp. 5763–5766
work page 2018
-
[8]
Speech emotion recognition for power customer service,
X. Li and R. Lin, “Speech emotion recognition for power customer service,” in2021 7th International Conference on Computer and Com- munications (ICCC), 2021, pp. 514–518
work page 2021
-
[9]
Ordinal learning for emotion recognition in customer service calls,
W. Han, T. Jiang, Y . Li, B. Schuller, and H. Ruan, “Ordinal learning for emotion recognition in customer service calls,” inICASSP 2020- 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 6494–6498
work page 2020
-
[10]
Y . Feng and L. Devillers, “End-to-end continuous speech emotion recog- nition in real-life customer service call center conversations,” in2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 2023, pp. 1–8
work page 2023
-
[11]
Front-end feature compensation and denoising for noise robust speech emotion recognition,
R. Chakraborty, A. Panda, M. Pandharipande, S. Joshi, and S. K. Kopparapu, “Front-end feature compensation and denoising for noise robust speech emotion recognition,” inInterspeech 2019, 2019, pp. 3257–3261
work page 2019
-
[12]
Emotion recognition in the noise applying large acoustic feature sets,
B. Schuller, D. Arsic, F. Wallhoff, and G. Rigoll, “Emotion recognition in the noise applying large acoustic feature sets,” inSpeech Prosody 2006, 2006, p. paper 128
work page 2006
-
[13]
S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Not all features are equal: Selection of robust features for speech emotion recognition in noisy environments,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Singapore, May 2022, pp. 6447–6451
work page 2022
-
[14]
Y .-T. Wu and C.-C. Lee, “Metricaug: A distortion metric-lead augmen- tation strategy for training noise-robust speech emotion recognizer,” in Proc. INTERSPEECH, vol. 2023, 2023, pp. 3587–3591
work page 2023
-
[15]
M. Jaiswal and E. M. Provost, “Best practices for noise-based augmen- tation to improve the performance of deployable speech-based emotion recognition systems,”arXiv preprint arXiv:2104.08806, 2021
-
[16]
Reinforcement learning based data augmentation for noise robust speech emotion recognition,
S. Ranjan, R. Chakraborty, and S. K. Kopparapu, “Reinforcement learning based data augmentation for noise robust speech emotion recognition,” inProc. Interspeech 2024, 2024, pp. 1040–1044
work page 2024
-
[17]
U. Tiwari, M. Soni, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7194–7198
work page 2020
-
[18]
S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Com- putation and memory efficient noise adaptation of Wav2Vec2.0 for noisy speech emotion recognition with skip connection adapters,” in Interspeech 2023, Dublin, Ireland, August 2023, pp. 1888–1892
work page 2023
-
[19]
——, “Describe where you are: Improving noise-robustness for speech emotion recognition with text description of the environment,”ArXiv e-prints (arXiv:2407.17716), pp. 1–14, July 2024
-
[20]
Towards noise robust speech emotion recog- nition using dynamic layer customization,
A. Wilf and E. M. Provost, “Towards noise robust speech emotion recog- nition using dynamic layer customization,” in2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8
work page 2021
-
[21]
S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Separation of emotional and reconstruction embeddings on ladder network to improve speech emotion recognition robustness in noisy conditions,” inInterspeech 2021, Brno, Czech Republic, August-September 2021, pp. 2871–2875
work page 2021
-
[22]
——, “Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP 2023), Rhodes island, Greece, June 2023, pp. 1–5
work page 2023
-
[23]
T. Deschamps-Berger, L. Lamel, and L. Devillers, “End-to-end speech emotion recognition: Challenges of real-life emergency call centers data recordings,” in2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8
work page 2021
-
[24]
Enhancing emergency response through speech emotion recognition: A machine learning approach,
P. Deb, H. Mahrin, and A. R. Bhuiyan, “Enhancing emergency response through speech emotion recognition: A machine learning approach,” in2023 26th International Conference on Computer and Information Technology (ICCIT), 2023, pp. 1–5
work page 2023
-
[25]
T. Deschamps-Berger, L. Lamel, and L. Devillers, “Investigating trans- former encoders and fusion strategies for speech emotion recognition in emergency call center conversations.” inCompanion Publication of the 2022 International Conference on Multimodal Interaction, 2022, pp. 144–153
work page 2022
-
[26]
Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,
A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller, “Towards robust speech emotion recognition using deep resid- ual networks for speech enhancement,” inInterspeech 2019, 2019, pp. 1691–1695
work page 2019
-
[27]
S. Kshirsagar, A. Pendyala, and T. H. Falk, “Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions,”Frontiers in Computer Science, vol. 5, p. 1039261, 2023
work page 2023
-
[28]
Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement,
Y .-W. Chen, J. Hirschberg, and Y . Tsao, “Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement,” arXiv preprint arXiv:2309.01164, 2023
-
[29]
Investigating speech enhancement and perceptual quality for speech emotion recog- nition,
A. R. Avila, M. J. Alam, D. O’Shaughnessy, and T. Falk, “Investigating speech enhancement and perceptual quality for speech emotion recog- nition,” inInterspeech 2018, 2018, pp. 3663–3667
work page 2018
-
[30]
J.-T. Tzeng, S.-G. Leem, A. N. Salman, C.-C. Lee, and C. Busso, “Noise-robust speech emotion recognition using shared self-supervised representations with integrated speech enhancement,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5
work page 2025
-
[31]
Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,
Z. Chen, V . Badrinarayanan, C.-Y . Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” inInternational conference on machine learning. PMLR, 2018, pp. 794–803
work page 2018
-
[32]
Gra- dient surgery for multi-task learning,
T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020
work page 2020
-
[33]
Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,
J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” inProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1930–1939
work page 2018
-
[34]
R. Lotfian and C. Busso, “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, October-December 2019
work page 2019
-
[35]
Odyssey 2024 - speech emotion recognition challenge: Dataset, baseline framework, and results,
L. Goncalves, A. Salman, A. Reddy Naini, L. Moro-Velazquez, T. The- baud, P. Garcia, N. Dehak, B. Sisman, and C. Busso, “Odyssey 2024 - speech emotion recognition challenge: Dataset, baseline framework, and results,” inThe Speaker and Language Recognition Workshop (Odyssey 2024), Quebec, Canada, June 2024, pp. 247–254
work page 2024
-
[36]
Selective acoustic feature enhancement for speech emotion recognition with noisy speech,
S.-G. Leem, D. Fulford, J.-P. Onnela, D. Gard, and C. Busso, “Selective acoustic feature enhancement for speech emotion recognition with noisy speech,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 32, pp. 917–929, 2024
work page 2024
-
[37]
Robust front-end processing for emotion recognition in noisy speech,
M. Pandharipande, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Robust front-end processing for emotion recognition in noisy speech,” in2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 324–328
work page 2018
-
[38]
An unsupervised frame selection technique for robust emotion recognition in noisy speech,
——, “An unsupervised frame selection technique for robust emotion recognition in noisy speech,” in2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2055–2059
work page 2018
-
[39]
Keep, delete, or substitute: Frame selection strategy for noise-robust speech emotion recognition,
S.-G. Leem, D. Fulford, J. Onnela, D. Gard, and C. Busso, “Keep, delete, or substitute: Frame selection strategy for noise-robust speech emotion recognition,” inInterspeech 2024, Kos Island, Greece, September 2024, pp. 3734–3738
work page 2024
-
[40]
From neural pca to deep unsupervised learning,
H. Valpola, “From neural pca to deep unsupervised learning,” in Advances in independent component analysis and learning machines. Elsevier, 2015, pp. 143–171
work page 2015
-
[41]
Semi-supervised speech emotion recog- nition with ladder networks,
S. Parthasarathy and C. Busso, “Semi-supervised speech emotion recog- nition with ladder networks,”IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 2697–2709, 2020
work page 2020
-
[42]
K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, “Domain separation networks,” inAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc.,
-
[43]
[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2016/file/45fbc6d3e05ebd93369ce542e8f2322d-Paper.pdf
work page 2016
-
[44]
Spectral feature mapping with mimic loss for robust speech recognition,
D. Bagchi, P. Plantinga, A. Stiff, and E. Fosler-Lussier, “Spectral feature mapping with mimic loss for robust speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5609–5613
work page 2018
-
[45]
S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, and E. M. Provost, “Capturing long-term temporal dependencies with convolutional JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 networks for continuous emotion recognition,” inInterspeech 2017, 2017, pp. 1253–1257
work page 2021
-
[46]
Versa- tile audio-visual learning for emotion recognition,
L. Goncalves, S.-G. Leem, W.-C. Lin, B. Sisman, and C. Busso, “Versa- tile audio-visual learning for emotion recognition,”IEEE Transactions on Affective Computing, vol. 16, no. 1, pp. 306–318, January-March 2025
work page 2025
-
[47]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020
work page 2020
-
[48]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[49]
Wavlm: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[50]
arXiv preprint arXiv:2105.01051 , year=
S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “Superb: Speech processing universal performance benchmark,”arXiv preprint arXiv:2105.01051, 2021
-
[51]
H.-S. Tsai, H.-J. Chang, W.-C. Huang, Z. Huang, K. Lakhotia, S.-w. Yang, S. Dong, A. Liu, C.-I. Lai, J. Shi, X. Chang, P. Hall, H.-J. Chen, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities,” inProceedings of the 60th Annual Meeting of the Assoc...
work page 2022
-
[52]
Boosting self-supervised embeddings for speech enhancement,
K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech enhancement,” in Interspeech 2022, 2022, pp. 186–190
work page 2022
-
[53]
Exploring wavlm on speech enhancement,
H. Song, S. Chen, Z. Chen, Y . Wu, T. Yoshioka, M. Tang, J. W. Shin, and S. Liu, “Exploring wavlm on speech enhancement,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 451– 457
work page 2023
-
[54]
Speech enhancement using self- supervised pre-trained model and vector quantization,
X.-Y . Zhao, Q.-S. Zhu, and J. Zhang, “Speech enhancement using self- supervised pre-trained model and vector quantization,” in2022 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 330–334
work page 2022
-
[55]
Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spo- ken language understanding,”arXiv preprint arXiv:2111.02735, 2021
-
[56]
Evaluating self-supervised speech repre- sentations for speech emotion recognition,
B. T. Atmaja and A. Sasou, “Evaluating self-supervised speech repre- sentations for speech emotion recognition,”IEEE Access, vol. 10, pp. 124 396–124 407, 2022
work page 2022
-
[57]
The Interspeech 2025 challenge on speech emotion recog- nition in naturalistic conditions,
A. Reddy Naini, L. Goncalves, A. Salman, P. Mote, I. ¨Ulgen, T. The- baud, L. Moro-Velazquez, L. Garcia, N. Dehak, B. Sisman, and C. Busso, “The Interspeech 2025 challenge on speech emotion recog- nition in naturalistic conditions,” inInterspeech 2025, vol. accepted, Rotterdam, The Netherlands, August 2025
work page 2025
-
[58]
Multi-task Sequence to Sequence Learning
M.-T. Luong, Q. V . Le, I. Sutskever, O. Vinyals, and L. Kaiser, “Multi- task sequence to sequence learning,”arXiv preprint arXiv:1511.06114, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[59]
L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,”arXiv preprint arXiv:1706.05137, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,
A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482–7491
work page 2018
-
[61]
S. Gupta, S. Mukherjee, K. Subudhi, E. Gonzalez, D. Jose, A. H. Awadallah, and J. Gao, “Sparsely activated mixture-of-experts are robust multi-task learners,”arXiv preprint arXiv:2204.07689, 2022
-
[62]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022
work page 2022
-
[63]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Llama-moe: Building mixture-of-experts from llama with continual pre- training,
T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y . Cheng, “Llama-moe: Building mixture-of-experts from llama with continual pre- training,”arXiv preprint arXiv:2406.16554, 2024
-
[66]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models,”arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
H. Liang, Z. Fan, R. Sarkar, Z. Jiang, T. Chen, K. Zou, Y . Cheng, C. Hao, Z. Wanget al., “M 3vit: Mixture-of-experts vision transformer for effi- cient multi-task learning with model-accelerator co-design,”Advances in Neural Information Processing Systems, vol. 35, pp. 28 441–28 457, 2022
work page 2022
-
[68]
Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,
Z. You, S. Feng, D. Su, and D. Yu, “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” inInterspeech 2021, 2021, pp. 2077–2081
work page 2021
-
[69]
Speechmoe2: Mixture-of-experts model with improved routing,
——, “Speechmoe2: Mixture-of-experts model with improved routing,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7217–7221
work page 2022
-
[70]
Language-routing mixture of experts for multilingual and code-switching speech recognition,
W. Wang, G. Ma, Y . Li, and B. Du, “Language-routing mixture of experts for multilingual and code-switching speech recognition,” in Interspeech 2023, 2023, pp. 1389–1393
work page 2023
-
[71]
Mixture-of-expert conformer for streaming multilingual asr,
K. Hu, B. Li, T. Sainath, Y . Zhang, and F. Beaufays, “Mixture-of-expert conformer for streaming multilingual asr,” inInterspeech 2023, 2023, pp. 3327–3331
work page 2023
-
[72]
Attentive statistics pooling for deep speaker embedding,
K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” inInterspeech 2018, 2018, pp. 2252–2256
work page 2018
-
[73]
Boosting objective scores of a speech enhancement model by metricgan post-processing,
S.-W. Fu, C.-F. Liao, T.-A. Hsieh, K.-H. Hung, S.-S. Wang, C. Yu, H.-C. Kuo, R. E. Zezario, Y .-J. Li, S.-Y . Chuanget al., “Boosting objective scores of a speech enhancement model by metricgan post-processing,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2020, pp. 455–459
work page 2020
-
[74]
Gating Neural Network for Large V ocabulary Audiovisual Speech Recognition,
F. Tao and C. Busso, “Gating Neural Network for Large V ocabulary Audiovisual Speech Recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1290–1302, 2018
work page 2018
-
[75]
Freesound datasets: a platform for the creation of open audio datasets,
E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a platform for the creation of open audio datasets,” inHu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou, China.[Canada]: International Society for Music In...
work page 2017
-
[76]
Icassp 2023 deep noise suppression challenge,
H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, M. Golestanehet al., “Icassp 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024
work page 2023
-
[77]
Investi- gating RNN-based speech enhancement methods for noise-robust Text- to-Speech,
C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi- gating RNN-based speech enhancement methods for noise-robust Text- to-Speech,” inProc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, pp. 146–152
work page 2016
-
[78]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[79]
Mod-squad: Designing mixtures of experts as modular multi-task learners,
Z. Chen, Y . Shen, M. Ding, Z. Chen, H. Zhao, E. G. Learned-Miller, and C. Gan, “Mod-squad: Designing mixtures of experts as modular multi-task learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 828–11 837
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.