arxiv: 2604.14606 · v1 · submitted 2026-04-16 · 📡 eess.AS

Recognition: unknown

UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

Xiaobin Rong , Zheng Wang , Yushi Wang , Jun Gao , Jing Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:15 UTC · model grok-4.3

classification 📡 eess.AS

keywords universal speech enhancementspeech enhancementWavLMknowledge distillationphonetic representationsgenerative modellow hallucination

0 comments

The pith

A distilled WavLM module produces clean phonetic representations to enable high-fidelity universal speech enhancement across sampling rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UniPASE as a generative pipeline for restoring speech degraded by arbitrary distortions at any sampling rate. It shows that first mapping inputs to linguistically faithful phonetic representations via a knowledge-distilled WavLM module, then deriving acoustic details and synthesizing waveforms, yields outputs that match or exceed prior models while limiting false content. A reader would care because real-world audio often arrives corrupted at mismatched rates, and faithful restoration without invented words or sounds supports clearer communication and archiving.

Core claim

UniPASE uses DeWavLM-Omni to convert degraded waveforms directly into clean phonetic representations. An Adapter produces enhanced acoustic representations from these, a neural Vocoder generates 16 kHz waveforms, and a PostNet upsamples to 48 kHz before final resampling to the original rate.

What carries the argument

DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on large-scale multi-distortion data that maps degraded inputs to clean, linguistically faithful phonetic representations.

If this is right

UniPASE achieves superior or competitive performance compared with existing state-of-the-art models on several evaluation datasets covering sub-tasks and full tasks.
The model served as the backbone for the 1st-place submission in the URGENT 2026 Challenge objective evaluation.
The pipeline handles inputs and outputs at multiple sampling rates without additional retraining.
Enhancement maintains high acoustic fidelity while keeping linguistic hallucinations low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prioritizing phonetic accuracy before acoustic synthesis could transfer to restoring other time-series signals such as music or sensor data.
The low-hallucination phonetic layer may improve accuracy when the enhanced output is fed into automatic speech recognition systems.
Expanding the distillation training set to include rarer distortion combinations would likely further reduce errors on edge cases.

Load-bearing premise

Fine-tuning WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset will reliably produce phonetic representations that remain clean and linguistically faithful with minimal hallucination across unseen distortions and sampling rates.

What would settle it

A test set of previously unseen distortion types or sampling rates where the output speech exhibits higher word error rates or semantic mismatches than strong baselines.

Figures

Figures reproduced from arXiv: 2604.14606 by Jing Lu, Jun Gao, Xiaobin Rong, Yushi Wang, Zheng Wang.

**Figure 2.** Figure 2: Architecture of the multi-scale representation discriminator (MSRD). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: WER scores across different loss fractions and longest burst lengths. In [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-language analysis of (a) acoustic representations, (b) phonetic [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons. Top: effect of MSRD. Bottom: effect of [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset. This module directly converts degraded waveforms into clean and linguistically faithful phonetic representations, ensuring robust enhancement with minimal linguistic hallucination. Based on these enhanced phonetic representations, an Adapter generates enhanced acoustic representations containing rich acoustic details, which a neural Vocoder uses to reconstruct corresponding high-fidelity 16-kHz waveforms. A PostNet then converts the waveforms to 48~kHz before resampling them to their original rates, enabling seamless handling of inputs and outputs at multiple sampling rates. Experimental results on several evaluation datasets, covering sub-tasks and full tasks, demonstrate that UniPASE achieves superior or competitive performance compared with existing state-of-the-art models. The proposed model also serves as the backbone of our submission to the URGENT 2026 Challenge, which achieved 1st place in the objective evaluation. The source code and audio demos are available at https://github.com/xiaobin-rong/unipase/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniPASE extends the PASE framework with a DeWavLM-Omni phonetic module and multi-rate PostNet chain, claims first place in the URGENT 2026 objective track, but the abstract supplies no numbers or ablations to support the low-hallucination guarantee.

read the letter

The main thing to know is that UniPASE builds a full pipeline for restoring speech across distortions and sampling rates by first mapping degraded audio to clean phonetic representations via a distilled WavLM variant, then adding acoustic details and handling rate conversion at the end. It reports that this setup powered the winning objective submission in the URGENT 2026 Challenge and releases code plus demos on GitHub.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniPASE, a generative model extending the PASE framework for universal speech enhancement (USE) that handles diverse distortions across multiple sampling rates. Its core component is DeWavLM-Omni, a WavLM model fine-tuned via knowledge distillation on a large-scale supervised multi-distortion dataset to map degraded waveforms to clean, linguistically faithful phonetic representations. These feed an Adapter for enhanced acoustic features, a neural Vocoder for 16 kHz waveform reconstruction, and a PostNet for 48 kHz upsampling followed by resampling to original rates. The manuscript claims superior or competitive performance versus state-of-the-art models on several evaluation datasets covering sub-tasks and full tasks, and states that UniPASE served as the backbone for the 1st-place entry in the URGENT 2026 Challenge objective track. Source code and audio demos are released.

Significance. If the central claims hold, UniPASE would advance universal speech enhancement by offering a unified pipeline that prioritizes low hallucination while supporting variable sampling rates and distortion types, with direct applicability to real-world audio restoration. The reported 1st-place result in the URGENT 2026 objective evaluation supplies external validation of practical utility. Explicit release of source code and demos is a positive contribution that supports reproducibility and community follow-up.

major comments (2)

[DeWavLM-Omni and Experimental Results] The low-hallucination and universal-enhancement claims rest on DeWavLM-Omni producing faithful phonetic representations for inputs outside the training distribution, yet the manuscript supplies no OOD test splits, dedicated ablations isolating this module, or content-preservation metrics (e.g., phoneme error rate or ASR-WER on enhanced outputs) that would directly test generalization across unseen distortions and sampling rates.
[Experimental Results] Performance claims are presented without accompanying quantitative tables, baseline comparisons, error bars, or dataset specifications in the abstract; the experimental section must furnish these details (including exact metrics on the URGENT 2026 test set) to substantiate the “superior or competitive” assertion and the 1st-place result.

minor comments (2)

The abstract refers to “several evaluation datasets” without naming them or indicating which cover sub-tasks versus full tasks; an explicit list would improve traceability.
The PostNet resampling step is described at a high level; adding a brief statement on how it avoids rate-conversion artifacts for arbitrary input rates would clarify the multi-rate handling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: The low-hallucination and universal-enhancement claims rest on DeWavLM-Omni producing faithful phonetic representations for inputs outside the training distribution, yet the manuscript supplies no OOD test splits, dedicated ablations isolating this module, or content-preservation metrics (e.g., phoneme error rate or ASR-WER on enhanced outputs) that would directly test generalization across unseen distortions and sampling rates.

Authors: We agree that explicit demonstration of out-of-distribution generalization is important for validating the low-hallucination claims. The training of DeWavLM-Omni uses a large-scale multi-distortion dataset that encompasses a broad variety of distortions and sampling rates, and the evaluation includes the URGENT 2026 Challenge test set, which features unseen conditions. However, we acknowledge the lack of dedicated OOD splits and ablations in the current manuscript. In the revision, we will include additional ablations isolating the DeWavLM-Omni module and report content-preservation metrics such as ASR-WER on the enhanced outputs to directly assess linguistic fidelity. We will also clarify the coverage of the training distribution. revision: partial
Referee: Performance claims are presented without accompanying quantitative tables, baseline comparisons, error bars, or dataset specifications in the abstract; the experimental section must furnish these details (including exact metrics on the URGENT 2026 test set) to substantiate the “superior or competitive” assertion and the 1st-place result.

Authors: The abstract provides a high-level summary of the results, as is conventional, while the experimental section of the manuscript includes detailed quantitative tables comparing UniPASE against state-of-the-art baselines on multiple datasets, along with dataset specifications. To address the concern, we will ensure that error bars are included where applicable (e.g., for multiple runs) and explicitly report the exact objective metrics achieved on the URGENT 2026 test set in the revised experimental section. This will substantiate the performance claims and the 1st-place result more clearly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental validation of proposed architecture

full rationale

The provided manuscript text describes UniPASE as an extension of a prior PASE framework, with core module DeWavLM-Omni obtained by fine-tuning WavLM via knowledge distillation on a supervised multi-distortion dataset. Subsequent stages (Adapter, Vocoder, PostNet) are described as sequential processing steps to produce enhanced waveforms at multiple sampling rates. All performance claims (superior/competitive results on evaluation datasets, 1st place in URGENT 2026 objective track) are presented as outcomes of experiments rather than quantities derived from equations or fitted parameters. No equations, derivations, or self-referential definitions appear in the abstract or described text. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the architecture or results. The chain is therefore self-contained as an empirical proposal with external validation via challenge results and dataset evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Only the abstract is available; no explicit free parameters, axioms, or independent evidence for new entities are stated. The approach relies on standard assumptions of neural network training and knowledge distillation.

invented entities (3)

DeWavLM-Omni no independent evidence
purpose: unified representation-level enhancement module for converting degraded waveforms to clean phonetic representations
Introduced as a fine-tuned version of WavLM via knowledge distillation
Adapter no independent evidence
purpose: generates enhanced acoustic representations from phonetic ones
New component described in the pipeline
PostNet no independent evidence
purpose: converts 16 kHz waveforms to 48 kHz before resampling to original rates
New upsampling module for multi-rate handling

pith-pipeline@v0.9.0 · 5544 in / 1393 out tokens · 29241 ms · 2026-05-10T10:15:14.969536+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Toward uni- versal speech enhancement for diverse input conditions,

W. Zhang, K. Saijo, Z.-Q. Wang, S. Watanabe, and Y . Qian, “Toward uni- versal speech enhancement for diverse input conditions,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–6

2023
[2]

Urgent challenge: Universality, robustness, and generalizability for speech enhancement,

W. Zhang, R. Scheibler, K. Saijo, S. Cornell, C. Li, Z. Ni, J. Pirklbauer, M. Sach, S. Watanabe, T. Fingscheidt, and Y . Qian, “Urgent challenge: Universality, robustness, and generalizability for speech enhancement,” inInterspeech 2024, 2024, pp. 4868–4872

2024
[3]

Interspeech 2025 URGENT Speech Enhancement Challenge,

K. Saijo, W. Zhang, S. Cornell, R. Scheibler, C. Li, Z. Ni, A. Kumar, M. Sach, Y . Fu, W. Wang, T. Fingscheidt, and S. Watanabe, “Interspeech 2025 URGENT Speech Enhancement Challenge,” inInterspeech 2025, 2025, pp. 858–862

2025
[4]

Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025,

Z. Sun, A. Li, T. Lei, R. Chen, M. Yu, C. Zheng, Y . Zhou, and D. Yu, “Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025,” inInterspeech 2025, 2025, pp. 873–877

2025
[5]

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network,

X. Rong, D. Wang, Q. Hu, Y . Wang, Y . Hu, and J. Lu, “TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network,” inInterspeech 2025, 2025, pp. 863–867

2025
[6]

FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge,

N. Goswami and T. Harada, “FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge,” inInterspeech 2025, 2025, pp. 883–887

2025
[7]

Universal Speech Enhancement with Regression and Gen- erative Mamba,

R. Chao, R. Nasretdinov, Y .-C. F. Wang, A. Jukic, S.-W. Fu, and Y . Tsao, “Universal Speech Enhancement with Regression and Gen- erative Mamba,” inInterspeech 2025, 2025, pp. 888–892

2025
[8]

Multistage Universal Speech Enhancement System for URGENT Challenge,

X. Le, Z. Chen, S. Sun, X. Xia, and C. Huang, “Multistage Universal Speech Enhancement System for URGENT Challenge,” inInterspeech 2025, 2025, pp. 868–872

2025
[9]

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement,

X. Rong, Q. Hu, M. Yesilbursa, K. Wojcicki, and J. Lu, “PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 39, pp. 32 826–32 834, Mar. 2026

2026
[10]

WavLM: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[11]

V oicefixer: A unified framework for high-fidelity speech restoration,

H. Liu, X. Liu, Q. Kong, Q. Tian, Y . Zhao, D. Wang, C. Huang, and Y . Wang, “V oicefixer: A unified framework for high-fidelity speech restoration,” inInterspeech 2022, 2022, pp. 4232–4236

2022
[12]

O., and Scaini, D

J. Serr `a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,”ArXiv preprint, vol. abs/2206.03065, 2022

work page arXiv 2022
[13]

Universal Score-based Speech Enhancement with High Content Preservation,

R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “Universal Score-based Speech Enhancement with High Content Preservation,” in Interspeech 2024, 2024, pp. 1165–1169

2024
[14]

MaskSR: Masked language model for full- band speech restoration,

X. Li, Q. Wang, and X. Liu, “MaskSR: Masked language model for full- band speech restoration,” inInterspeech 2024, 2024, pp. 2275–2279

2024
[15]

Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,

J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3085–3098, 2025

2025
[16]

LLaSE-G1: Incentiviz- ing generalization capability for LLaMA-based speech enhancement,

B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiao, C. Weng, W. Xue, and L. Xie, “LLaSE-G1: Incentiviz- ing generalization capability for LLaMA-based speech enhancement,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, Jul. 2025, pp. 13 292–13 305

2025
[17]

KS-Net: Multi-Band Joint Speech Restoration and Enhancement Network for 2024 ICASSP SSI Challenge,

G. Yu, R. Han, C. Xu, H. Zhao, N. Li, C. Zhang, X. Zheng, C. Zhou, Q. Huang, and B. Yu, “KS-Net: Multi-Band Joint Speech Restoration and Enhancement Network for 2024 ICASSP SSI Challenge,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing Workshops (ICASSPW), 2024, pp. 33–34

2024
[18]

RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement,

M. Liu, Z. Chen, X. Yan, Y . Lv, X. Xia, C. Huang, Y . Xiao, and L. Xie, “RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 49– 50

2024
[19]

General Speech Restoration Using Two-Stage Generative Adversarial Networks,

Q. Hu, T. Tan, M. Tang, Y . Hu, C. Zhu, and J. Lu, “General Speech Restoration Using Two-Stage Generative Adversarial Networks,” in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 31–32

2024
[20]

High- fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved RVQGAN,”Advances in Neural Information Processing Systems, vol. 36, pp. 27 980–27 993, 2023

2023
[21]

Investigating self-supervised learning for speech enhancement and separation,

Z. Huang, S. Watanabe, S.-w. Yang, P. Garc ´ıa, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6837–6841

2022
[22]

Boosting self-supervised embeddings for speech enhancement,

K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech enhancement,” in Interspeech 2022, 2022, pp. 186–190

2022
[23]

Miipher: A robust speech restoration model integrating self-supervised speech and text representations,

Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, Y . Zhang, W. Han, A. Bapna, and M. Bacchiani, “Miipher: A robust speech restoration model integrating self-supervised speech and text representations,” in2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023, pp. 1–5

2023
[24]

Effi- cient speech enhancement via embeddings from pre-trained generative audioencoders,

X. Sun, H. Dinkel, Y . Niu, L. Wang, J. Zhang, and J. Luan, “Effi- cient speech enhancement via embeddings from pre-trained generative audioencoders,”arXiv preprint arXiv:2506.11514, 2025

work page arXiv 2025
[25]

Miipher-2: A universal speech restoration model for million- hour scale data restoration,

S. Karita, Y . Koizumi, H. Zen, H. Ishikawa, R. Scheibler, and M. Bac- chiani, “Miipher-2: A universal speech restoration model for million- hour scale data restoration,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025, pp. 1–5

2025
[26]

Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhance- ment,

Y . Koizumi, S. Karita, S. Wisdom, H. Erdogan, J. R. Hershey, L. Jones, and M. Bacchiani, “Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhance- ment,” in2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 161–165

2021
[27]

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,

Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250

2021
[28]

Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara N

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wanget al., “Google usm: Scaling automatic speech recognition beyond 100 languages,”arXiv preprint arXiv:2303.01037, 2023

work page arXiv 2023
[29]

GenSE: Generative speech enhancement via language models using hierarchical modeling,

J. Yao, H. Liu, C. Chen, Y . Hu, E. Chng, and L. Xie, “GenSE: Generative speech enhancement via language models using hierarchical modeling,” arXiv preprint arXiv:2502.02942, 2025

work page arXiv 2025
[30]

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

X. Li, H. Xie, Z. Wang, Z. Zhang, L. Xiao, and L. Xie, “SenSE: Semantic-aware high-fidelity universal speech enhancement,” 2025. [Online]. Available: https://arxiv.org/abs/2509.24708

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Self-supervised speech representations are more phonetic than semantic,

K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578–4582

2024
[32]

Comparative layer-wise analysis of self-supervised speech models,

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 14

2023
[33]

V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,”arXiv preprint arXiv:2306.00814, 2023

work page arXiv 2023
[34]

WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Liet al., “WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inThe Thirteenth International Conference on Learning Representations, 2024

2024
[35]

Rectifier nonlinearities improve neural network acoustic models,

A. L. Maas, A. Y . Hannun, A. Y . Nget al., “Rectifier nonlinearities improve neural network acoustic models,” inProc. ICLM, vol. 30, no. 1. Atlanta, GA, 2013, p. 3

2013
[36]

Least squares generative adversarial networks,

X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813–2821

2017
[37]

HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,”Advances in neural information processing systems, vol. 33, pp. 17 022–17 033, 2020

2020
[38]

Channel-wise subband input for better voice and accompaniment separation on high resolution music,

H. Liu, L. Xie, J. Wu, and G. Yang, “Channel-wise subband input for better voice and accompaniment separation on high resolution music,” inInterspeech 2020. ISCA, 2020, pp. 1241–1245

2020
[39]

ICASSP 2023 deep noise suppression challenge,

H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, M. Golestanehet al., “ICASSP 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024

2023
[40]

LibriTTS: A corpus derived from LibriSpeech for text-to- speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to- speech,” inInterspeech 2019, 2019, pp. 1526–1530

2019
[41]

The V oice Bank corpus: Design, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The V oice Bank corpus: Design, collection and data analysis of a large regional accent speech database,” in2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O- COCOSDA/CASLRE). IEEE, 2013, pp. 1–4

2013
[42]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inInterspeech 2024, 2024, pp. 4873–4877

2024
[43]

MLS: A Large-Scale Multilingual Dataset for Speech Research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inInterspeech 2020, 2020, pp. 2757–2761

2020
[44]

Common V oice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4218–4222

2020
[45]

WHAM!: Extending speech separation to noisy environments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inInterspeech 2019, 2019, pp. 1368–1372

2019
[46]

FSD50K: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

2021
[47]

FMA: A Dataset For Music Analysis

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

work page Pith review arXiv 2016
[48]

A study on data augmentation of reverberant speech for robust speech recogni- tion,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recogni- tion,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 5220–5224

2017
[49]

Wind noise reduction with a diffusion-based stochastic regeneration model,

J.-M. Lemercier, J. Thiemann, R. Koning, and T. Gerkmann, “Wind noise reduction with a diffusion-based stochastic regeneration model,” inSpeech Communication; 15th ITG Conference, 2023, pp. 116–120

2023
[50]

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” inInter- speech 2020, 2020, pp. 2492–2496

2020
[51]

The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge,

L. Diener, S. Branets, A. Saabas, and R. Cutler, “The ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge,”IEEE Open Journal of Signal Processing, vol. 6, pp. 231–237, 2025

2024
[52]

TF-GridNet: Integrating full-and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

2023
[53]

StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

2023
[54]

DNSMOS P.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May

2022
[55]

IEEE, 2022, pp. 886–890

2022
[56]

UTMOS: utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” inInterspeech 2022. ISCA, 2022, pp. 4521–4525

2022
[57]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” inInterspeech 2021, 2021, pp. 2127–2131

2021
[58]

INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge,

Lorenz Diener and Sten Sootla and Solomiya Branets and Ando Saabas and Robert Aichner and Ross Cutler, “INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge,” inInterspeech 2022, 2022, pp. 580–584

2022
[59]

Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,

I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,”Rec. ITU-T P . 862, 2001

2001
[60]

An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,

J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016

2009
[61]

Audio Similarity is Unreliable as a Proxy for Audio Quality,

Pranay Manocha and Zeyu Jin and Adam Finkelstein, “Audio Similarity is Unreliable as a Proxy for Audio Quality,” inInterspeech 2022, 2022, pp. 3553–3557

2022
[62]

Evaluation metrics for generative speech enhancement methods: Issues and perspectives,

J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech enhancement methods: Issues and perspectives,” inSpeech Communication; 15th ITG Conference. VDE, 2023, pp. 265–269

2023
[63]

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,” inInterspeech 2024, 2024, pp. 4943–4947

2024
[64]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021
[65]

mHuBERT-147: A Compact Multilingual HuBERT Model,

M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mHuBERT-147: A Compact Multilingual HuBERT Model,” inInter- speech 2024, 2024, pp. 3939–3943

2024
[66]

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,

Qiantong Xu and Alexei Baevski and Michael Auli, “Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,” inInterspeech 2022, 2022, pp. 2113–2117

2022
[67]

Pushing the limits of raw waveform speaker recognition,

J. weon Jung, Y . Kim, H.-S. Heo, B.-J. Lee, Y . Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” inInterspeech 2022, 2022, pp. 2228–2232

2022
[68]

OWSM v3.1: Better and faster open whisper-style speech models based on e- branchformer,

Y . Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y . Sudo, M. Shakeel, K. Choi, J. Shi, X. Chang, J. weon Jung, and S. Watanabe, “OWSM v3.1: Better and faster open whisper-style speech models based on e- branchformer,” inInterspeech 2024, 2024, pp. 352–356

2024
[69]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[70]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

2022
[71]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

Methods for subjective determination of transmission quality,

International Telecommunication Union, “Methods for subjective determination of transmission quality,” Aug. 1996, formerly Rec. P.80. [Online]. Available: https://www.itu.int/rec/T-REC-P.800

1996
[73]

Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method,

B. Naderi, S. M ¨oller, and R. Cutler, “Speech Quality Assessment in Crowdsourcing: Comparison Category Rating Method,” in2021 13th In- ternational Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 31–36. APPENDIX A. The Packet Loss Detection Algorithm As shown in Algorithm 1, the Packet Loss Detection (PLD) algorithm segments the input wa...

2021