Recognition: unknown
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations
Pith reviewed 2026-05-10 10:15 UTC · model grok-4.3
The pith
A distilled WavLM module produces clean phonetic representations to enable high-fidelity universal speech enhancement across sampling rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniPASE uses DeWavLM-Omni to convert degraded waveforms directly into clean phonetic representations. An Adapter produces enhanced acoustic representations from these, a neural Vocoder generates 16 kHz waveforms, and a PostNet upsamples to 48 kHz before final resampling to the original rate.
What carries the argument
DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on large-scale multi-distortion data that maps degraded inputs to clean, linguistically faithful phonetic representations.
If this is right
- UniPASE achieves superior or competitive performance compared with existing state-of-the-art models on several evaluation datasets covering sub-tasks and full tasks.
- The model served as the backbone for the 1st-place submission in the URGENT 2026 Challenge objective evaluation.
- The pipeline handles inputs and outputs at multiple sampling rates without additional retraining.
- Enhancement maintains high acoustic fidelity while keeping linguistic hallucinations low.
Where Pith is reading between the lines
- Prioritizing phonetic accuracy before acoustic synthesis could transfer to restoring other time-series signals such as music or sensor data.
- The low-hallucination phonetic layer may improve accuracy when the enhanced output is fed into automatic speech recognition systems.
- Expanding the distillation training set to include rarer distortion combinations would likely further reduce errors on edge cases.
Load-bearing premise
Fine-tuning WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset will reliably produce phonetic representations that remain clean and linguistically faithful with minimal hallucination across unseen distortions and sampling rates.
What would settle it
A test set of previously unseen distortion types or sampling rates where the output speech exhibits higher word error rates or semantic mismatches than strong baselines.
Figures
read the original abstract
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset. This module directly converts degraded waveforms into clean and linguistically faithful phonetic representations, ensuring robust enhancement with minimal linguistic hallucination. Based on these enhanced phonetic representations, an Adapter generates enhanced acoustic representations containing rich acoustic details, which a neural Vocoder uses to reconstruct corresponding high-fidelity 16-kHz waveforms. A PostNet then converts the waveforms to 48~kHz before resampling them to their original rates, enabling seamless handling of inputs and outputs at multiple sampling rates. Experimental results on several evaluation datasets, covering sub-tasks and full tasks, demonstrate that UniPASE achieves superior or competitive performance compared with existing state-of-the-art models. The proposed model also serves as the backbone of our submission to the URGENT 2026 Challenge, which achieved 1st place in the objective evaluation. The source code and audio demos are available at https://github.com/xiaobin-rong/unipase/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UniPASE, a generative model extending the PASE framework for universal speech enhancement (USE) that handles diverse distortions across multiple sampling rates. Its core component is DeWavLM-Omni, a WavLM model fine-tuned via knowledge distillation on a large-scale supervised multi-distortion dataset to map degraded waveforms to clean, linguistically faithful phonetic representations. These feed an Adapter for enhanced acoustic features, a neural Vocoder for 16 kHz waveform reconstruction, and a PostNet for 48 kHz upsampling followed by resampling to original rates. The manuscript claims superior or competitive performance versus state-of-the-art models on several evaluation datasets covering sub-tasks and full tasks, and states that UniPASE served as the backbone for the 1st-place entry in the URGENT 2026 Challenge objective track. Source code and audio demos are released.
Significance. If the central claims hold, UniPASE would advance universal speech enhancement by offering a unified pipeline that prioritizes low hallucination while supporting variable sampling rates and distortion types, with direct applicability to real-world audio restoration. The reported 1st-place result in the URGENT 2026 objective evaluation supplies external validation of practical utility. Explicit release of source code and demos is a positive contribution that supports reproducibility and community follow-up.
major comments (2)
- [DeWavLM-Omni and Experimental Results] The low-hallucination and universal-enhancement claims rest on DeWavLM-Omni producing faithful phonetic representations for inputs outside the training distribution, yet the manuscript supplies no OOD test splits, dedicated ablations isolating this module, or content-preservation metrics (e.g., phoneme error rate or ASR-WER on enhanced outputs) that would directly test generalization across unseen distortions and sampling rates.
- [Experimental Results] Performance claims are presented without accompanying quantitative tables, baseline comparisons, error bars, or dataset specifications in the abstract; the experimental section must furnish these details (including exact metrics on the URGENT 2026 test set) to substantiate the “superior or competitive” assertion and the 1st-place result.
minor comments (2)
- The abstract refers to “several evaluation datasets” without naming them or indicating which cover sub-tasks versus full tasks; an explicit list would improve traceability.
- The PostNet resampling step is described at a high level; adding a brief statement on how it avoids rate-conversion artifacts for arbitrary input rates would clarify the multi-rate handling.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: The low-hallucination and universal-enhancement claims rest on DeWavLM-Omni producing faithful phonetic representations for inputs outside the training distribution, yet the manuscript supplies no OOD test splits, dedicated ablations isolating this module, or content-preservation metrics (e.g., phoneme error rate or ASR-WER on enhanced outputs) that would directly test generalization across unseen distortions and sampling rates.
Authors: We agree that explicit demonstration of out-of-distribution generalization is important for validating the low-hallucination claims. The training of DeWavLM-Omni uses a large-scale multi-distortion dataset that encompasses a broad variety of distortions and sampling rates, and the evaluation includes the URGENT 2026 Challenge test set, which features unseen conditions. However, we acknowledge the lack of dedicated OOD splits and ablations in the current manuscript. In the revision, we will include additional ablations isolating the DeWavLM-Omni module and report content-preservation metrics such as ASR-WER on the enhanced outputs to directly assess linguistic fidelity. We will also clarify the coverage of the training distribution. revision: partial
-
Referee: Performance claims are presented without accompanying quantitative tables, baseline comparisons, error bars, or dataset specifications in the abstract; the experimental section must furnish these details (including exact metrics on the URGENT 2026 test set) to substantiate the “superior or competitive” assertion and the 1st-place result.
Authors: The abstract provides a high-level summary of the results, as is conventional, while the experimental section of the manuscript includes detailed quantitative tables comparing UniPASE against state-of-the-art baselines on multiple datasets, along with dataset specifications. To address the concern, we will ensure that error bars are included where applicable (e.g., for multiple runs) and explicitly report the exact objective metrics achieved on the URGENT 2026 test set in the revised experimental section. This will substantiate the performance claims and the 1st-place result more clearly. revision: yes
Circularity Check
No significant circularity; claims rest on experimental validation of proposed architecture
full rationale
The provided manuscript text describes UniPASE as an extension of a prior PASE framework, with core module DeWavLM-Omni obtained by fine-tuning WavLM via knowledge distillation on a supervised multi-distortion dataset. Subsequent stages (Adapter, Vocoder, PostNet) are described as sequential processing steps to produce enhanced waveforms at multiple sampling rates. All performance claims (superior/competitive results on evaluation datasets, 1st place in URGENT 2026 objective track) are presented as outcomes of experiments rather than quantities derived from equations or fitted parameters. No equations, derivations, or self-referential definitions appear in the abstract or described text. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the architecture or results. The chain is therefore self-contained as an empirical proposal with external validation via challenge results and dataset evaluations.
Axiom & Free-Parameter Ledger
invented entities (3)
-
DeWavLM-Omni
no independent evidence
-
Adapter
no independent evidence
-
PostNet
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Toward uni- versal speech enhancement for diverse input conditions,
W. Zhang, K. Saijo, Z.-Q. Wang, S. Watanabe, and Y . Qian, “Toward uni- versal speech enhancement for diverse input conditions,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–6
2023
-
[2]
Urgent challenge: Universality, robustness, and generalizability for speech enhancement,
W. Zhang, R. Scheibler, K. Saijo, S. Cornell, C. Li, Z. Ni, J. Pirklbauer, M. Sach, S. Watanabe, T. Fingscheidt, and Y . Qian, “Urgent challenge: Universality, robustness, and generalizability for speech enhancement,” inInterspeech 2024, 2024, pp. 4868–4872
2024
-
[3]
Interspeech 2025 URGENT Speech Enhancement Challenge,
K. Saijo, W. Zhang, S. Cornell, R. Scheibler, C. Li, Z. Ni, A. Kumar, M. Sach, Y . Fu, W. Wang, T. Fingscheidt, and S. Watanabe, “Interspeech 2025 URGENT Speech Enhancement Challenge,” inInterspeech 2025, 2025, pp. 858–862
2025
-
[4]
Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025,
Z. Sun, A. Li, T. Lei, R. Chen, M. Yu, C. Zheng, Y . Zhou, and D. Yu, “Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025,” inInterspeech 2025, 2025, pp. 873–877
2025
-
[5]
TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network,
X. Rong, D. Wang, Q. Hu, Y . Wang, Y . Hu, and J. Lu, “TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network,” inInterspeech 2025, 2025, pp. 863–867
2025
-
[6]
FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge,
N. Goswami and T. Harada, “FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge,” inInterspeech 2025, 2025, pp. 883–887
2025
-
[7]
Universal Speech Enhancement with Regression and Gen- erative Mamba,
R. Chao, R. Nasretdinov, Y .-C. F. Wang, A. Jukic, S.-W. Fu, and Y . Tsao, “Universal Speech Enhancement with Regression and Gen- erative Mamba,” inInterspeech 2025, 2025, pp. 888–892
2025
-
[8]
Multistage Universal Speech Enhancement System for URGENT Challenge,
X. Le, Z. Chen, S. Sun, X. Xia, and C. Huang, “Multistage Universal Speech Enhancement System for URGENT Challenge,” inInterspeech 2025, 2025, pp. 868–872
2025
-
[9]
PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement,
X. Rong, Q. Hu, M. Yesilbursa, K. Wojcicki, and J. Lu, “PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 39, pp. 32 826–32 834, Mar. 2026
2026
-
[10]
WavLM: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[11]
V oicefixer: A unified framework for high-fidelity speech restoration,
H. Liu, X. Liu, Q. Kong, Q. Tian, Y . Zhao, D. Wang, C. Huang, and Y . Wang, “V oicefixer: A unified framework for high-fidelity speech restoration,” inInterspeech 2022, 2022, pp. 4232–4236
2022
-
[12]
J. Serr `a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,”ArXiv preprint, vol. abs/2206.03065, 2022
-
[13]
Universal Score-based Speech Enhancement with High Content Preservation,
R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “Universal Score-based Speech Enhancement with High Content Preservation,” in Interspeech 2024, 2024, pp. 1165–1169
2024
-
[14]
MaskSR: Masked language model for full- band speech restoration,
X. Li, Q. Wang, and X. Liu, “MaskSR: Masked language model for full- band speech restoration,” inInterspeech 2024, 2024, pp. 2275–2279
2024
-
[15]
Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,
J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3085–3098, 2025
2025
-
[16]
LLaSE-G1: Incentiviz- ing generalization capability for LLaMA-based speech enhancement,
B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiao, C. Weng, W. Xue, and L. Xie, “LLaSE-G1: Incentiviz- ing generalization capability for LLaMA-based speech enhancement,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, Jul. 2025, pp. 13 292–13 305
2025
-
[17]
KS-Net: Multi-Band Joint Speech Restoration and Enhancement Network for 2024 ICASSP SSI Challenge,
G. Yu, R. Han, C. Xu, H. Zhao, N. Li, C. Zhang, X. Zheng, C. Zhou, Q. Huang, and B. Yu, “KS-Net: Multi-Band Joint Speech Restoration and Enhancement Network for 2024 ICASSP SSI Challenge,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing Workshops (ICASSPW), 2024, pp. 33–34
2024
-
[18]
RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement,
M. Liu, Z. Chen, X. Yan, Y . Lv, X. Xia, C. Huang, Y . Xiao, and L. Xie, “RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 49– 50
2024
-
[19]
General Speech Restoration Using Two-Stage Generative Adversarial Networks,
Q. Hu, T. Tan, M. Tang, Y . Hu, C. Zhu, and J. Lu, “General Speech Restoration Using Two-Stage Generative Adversarial Networks,” in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 31–32
2024
-
[20]
High- fidelity audio compression with improved RVQGAN,
R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved RVQGAN,”Advances in Neural Information Processing Systems, vol. 36, pp. 27 980–27 993, 2023
2023
-
[21]
Investigating self-supervised learning for speech enhancement and separation,
Z. Huang, S. Watanabe, S.-w. Yang, P. Garc ´ıa, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6837–6841
2022
-
[22]
Boosting self-supervised embeddings for speech enhancement,
K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech enhancement,” in Interspeech 2022, 2022, pp. 186–190
2022
-
[23]
Miipher: A robust speech restoration model integrating self-supervised speech and text representations,
Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, Y . Zhang, W. Han, A. Bapna, and M. Bacchiani, “Miipher: A robust speech restoration model integrating self-supervised speech and text representations,” in2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023, pp. 1–5
2023
-
[24]
Effi- cient speech enhancement via embeddings from pre-trained generative audioencoders,
X. Sun, H. Dinkel, Y . Niu, L. Wang, J. Zhang, and J. Luan, “Effi- cient speech enhancement via embeddings from pre-trained generative audioencoders,”arXiv preprint arXiv:2506.11514, 2025
-
[25]
Miipher-2: A universal speech restoration model for million- hour scale data restoration,
S. Karita, Y . Koizumi, H. Zen, H. Ishikawa, R. Scheibler, and M. Bac- chiani, “Miipher-2: A universal speech restoration model for million- hour scale data restoration,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025, pp. 1–5
2025
-
[26]
Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhance- ment,
Y . Koizumi, S. Karita, S. Wisdom, H. Erdogan, J. R. Hershey, L. Jones, and M. Bacchiani, “Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhance- ment,” in2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 161–165
2021
-
[27]
w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,
Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250
2021
-
[28]
Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wanget al., “Google usm: Scaling automatic speech recognition beyond 100 languages,”arXiv preprint arXiv:2303.01037, 2023
-
[29]
GenSE: Generative speech enhancement via language models using hierarchical modeling,
J. Yao, H. Liu, C. Chen, Y . Hu, E. Chng, and L. Xie, “GenSE: Generative speech enhancement via language models using hierarchical modeling,” arXiv preprint arXiv:2502.02942, 2025
-
[30]
SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement
X. Li, H. Xie, Z. Wang, Z. Zhang, L. Xiao, and L. Xie, “SenSE: Semantic-aware high-fidelity universal speech enhancement,” 2025. [Online]. Available: https://arxiv.org/abs/2509.24708
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Self-supervised speech representations are more phonetic than semantic,
K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578–4582
2024
-
[32]
Comparative layer-wise analysis of self-supervised speech models,
A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 14
2023
-
[33]
H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,”arXiv preprint arXiv:2306.00814, 2023
-
[34]
WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,
S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Liet al., “WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[35]
Rectifier nonlinearities improve neural network acoustic models,
A. L. Maas, A. Y . Hannun, A. Y . Nget al., “Rectifier nonlinearities improve neural network acoustic models,” inProc. ICLM, vol. 30, no. 1. Atlanta, GA, 2013, p. 3
2013
-
[36]
Least squares generative adversarial networks,
X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813–2821
2017
-
[37]
HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,”Advances in neural information processing systems, vol. 33, pp. 17 022–17 033, 2020
2020
-
[38]
Channel-wise subband input for better voice and accompaniment separation on high resolution music,
H. Liu, L. Xie, J. Wu, and G. Yang, “Channel-wise subband input for better voice and accompaniment separation on high resolution music,” inInterspeech 2020. ISCA, 2020, pp. 1241–1245
2020
-
[39]
ICASSP 2023 deep noise suppression challenge,
H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, M. Golestanehet al., “ICASSP 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024
2023
-
[40]
LibriTTS: A corpus derived from LibriSpeech for text-to- speech,
H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to- speech,” inInterspeech 2019, 2019, pp. 1526–1530
2019
-
[41]
The V oice Bank corpus: Design, collection and data analysis of a large regional accent speech database,
C. Veaux, J. Yamagishi, and S. King, “The V oice Bank corpus: Design, collection and data analysis of a large regional accent speech database,” in2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O- COCOSDA/CASLRE). IEEE, 2013, pp. 1–4
2013
-
[42]
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,
J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inInterspeech 2024, 2024, pp. 4873–4877
2024
-
[43]
MLS: A Large-Scale Multilingual Dataset for Speech Research,
V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inInterspeech 2020, 2020, pp. 2757–2761
2020
-
[44]
Common V oice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222
2020
-
[45]
WHAM!: Extending speech separation to noisy environments,
G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inInterspeech 2019, 2019, pp. 1368–1372
2019
-
[46]
FSD50K: an open dataset of human-labeled sound events,
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021
2021
-
[47]
FMA: A Dataset For Music Analysis
M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016
work page Pith review arXiv 2016
-
[48]
A study on data augmentation of reverberant speech for robust speech recogni- tion,
T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recogni- tion,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 5220–5224
2017
-
[49]
Wind noise reduction with a diffusion-based stochastic regeneration model,
J.-M. Lemercier, J. Thiemann, R. Koning, and T. Gerkmann, “Wind noise reduction with a diffusion-based stochastic regeneration model,” inSpeech Communication; 15th ITG Conference, 2023, pp. 116–120
2023
-
[50]
The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,
C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” inInter- speech 2020, 2020, pp. 2492–2496
2020
-
[51]
The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge,
L. Diener, S. Branets, A. Saabas, and R. Cutler, “The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge,”IEEE Open Journal of Signal Processing, vol. 6, pp. 231–237, 2025
2024
-
[52]
TF-GridNet: Integrating full-and sub-band modeling for speech separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023
2023
-
[53]
StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,
J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023
2023
-
[54]
DNSMOS P.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,
C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May
2022
-
[55]
IEEE, 2022, pp. 886–890
2022
-
[56]
UTMOS: utokyo-sarulab system for voicemos challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” inInterspeech 2022. ISCA, 2022, pp. 4521–4525
2022
-
[57]
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,
G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” inInterspeech 2021, 2021, pp. 2127–2131
2021
-
[58]
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge,
Lorenz Diener and Sten Sootla and Solomiya Branets and Ando Saabas and Robert Aichner and Ross Cutler, “INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge,” inInterspeech 2022, 2022, pp. 580–584
2022
-
[59]
Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,
I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,”Rec. ITU-T P . 862, 2001
2001
-
[60]
An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,
J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016
2009
-
[61]
Audio Similarity is Unreliable as a Proxy for Audio Quality,
Pranay Manocha and Zeyu Jin and Adam Finkelstein, “Audio Similarity is Unreliable as a Proxy for Audio Quality,” inInterspeech 2022, 2022, pp. 3553–3557
2022
-
[62]
Evaluation metrics for generative speech enhancement methods: Issues and perspectives,
J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech enhancement methods: Issues and perspectives,” inSpeech Communication; 15th ITG Conference. VDE, 2023, pp. 265–269
2023
-
[63]
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,
T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,” inInterspeech 2024, 2024, pp. 4943–4947
2024
-
[64]
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021
2021
-
[65]
mHuBERT-147: A Compact Multilingual HuBERT Model,
M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mHuBERT-147: A Compact Multilingual HuBERT Model,” inInter- speech 2024, 2024, pp. 3939–3943
2024
-
[66]
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,
Qiantong Xu and Alexei Baevski and Michael Auli, “Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,” inInterspeech 2022, 2022, pp. 2113–2117
2022
-
[67]
Pushing the limits of raw waveform speaker recognition,
J. weon Jung, Y . Kim, H.-S. Heo, B.-J. Lee, Y . Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” inInterspeech 2022, 2022, pp. 2228–2232
2022
-
[68]
OWSM v3.1: Better and faster open whisper-style speech models based on e- branchformer,
Y . Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y . Sudo, M. Shakeel, K. Choi, J. Shi, X. Chang, J. weon Jung, and S. Watanabe, “OWSM v3.1: Better and faster open whisper-style speech models based on e- branchformer,” inInterspeech 2024, 2024, pp. 352–356
2024
-
[69]
Robust speech recognition via large-scale weak supervi- sion,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[70]
A convnet for the 2020s,
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986
2022
-
[71]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
Methods for subjective determination of transmission quality,
International Telecommunication Union, “Methods for subjective determination of transmission quality,” Aug. 1996, formerly Rec. P.80. [Online]. Available: https://www.itu.int/rec/T-REC-P.800
1996
-
[73]
Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method,
B. Naderi, S. M ¨oller, and R. Cutler, “Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method,” in2021 13th In- ternational Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 31–36. APPENDIX A. The Packet Loss Detection Algorithm As shown in Algorithm 1, the Packet Loss Detection (PLD) algorithm segments the input wa...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.