Robust One-step Speech Enhancement via Consistency Distillation

Liang Xu; Longfei Felix Yan; W. Bastiaan Kleijn

arxiv: 2507.05688 · v2 · pith:ECFNWJVSnew · submitted 2025-07-08 · 📡 eess.AS · cs.SD

Robust One-step Speech Enhancement via Consistency Distillation

Liang Xu , Longfei Felix Yan , W. Bastiaan Kleijn This is my paper

Pith reviewed 2026-05-22 00:05 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speech enhancementconsistency distillationdiffusion modelsone-step samplingrobustnessreal-time inference

0 comments

The pith

A randomized trajectory and auxiliary losses let a one-step consistency model outperform its multi-step diffusion teacher in speech enhancement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that consistency distillation can be made robust enough for one-step speech enhancement by randomizing the learning trajectory and adding joint time-domain losses. This counters the built-in bias toward the teacher's sampling path and lets the student recover from the teacher's errors. The result is a model that runs 54 times faster than the 30-step teacher while delivering higher speech quality on the VoiceBank-DEMAND benchmark. The same model also shows stronger generalization on unseen datasets and real recordings. If correct, this removes the main barrier that has kept diffusion-based enhancers out of real-time use.

Core claim

Distilling a one-step consistency model from a diffusion teacher becomes robust when the student is trained on a randomized trajectory and jointly optimized with two time-domain auxiliary losses, allowing it to exceed the teacher's performance at 1/54th the inference cost.

What carries the argument

The combination of a randomized learning trajectory during distillation and joint optimization against two time-domain auxiliary losses, which reduces trajectory bias and corrects inherited errors.

If this is right

Real-time speech enhancement becomes practical on edge devices without sacrificing quality.
The distilled model can replace multi-step diffusion teachers in latency-sensitive applications.
Performance gains hold across both simulated and real-world noisy recordings.
Generalization improves on out-of-domain noise conditions compared with the original teacher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same randomization-plus-auxiliary-loss pattern could be tested on other one-step distillation tasks such as audio source separation.
If the error-correction effect scales, the approach may reduce the need for very large teacher models in audio generation pipelines.

Load-bearing premise

Randomizing the learning trajectory and adding the two time-domain losses will overcome the teacher's sampling bias and inherited errors without creating new artifacts or hurting performance on clean speech.

What would settle it

An experiment that feeds the one-step model perfectly clean speech and measures whether its output quality drops below the teacher's or introduces measurable artifacts.

read the original abstract

Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a one-step consistency model to beat its 30-step teacher on speech enhancement metrics, but the auxiliary losses' role in fixing trajectory bias lacks the clean-input controls needed to rule out simpler explanations.

read the letter

The core result is that ROSE-CD distills a one-step model from a diffusion teacher and reports better scores than the teacher itself on VoiceBank-DEMAND while running 54 times faster. The technical additions are a randomized sampling trajectory during training and joint optimization with two time-domain auxiliary losses, which the authors say reduce inherited trajectory bias and let the student recover from teacher errors. This combination is presented as new relative to earlier consistency-distillation work in the area, and the experiments include out-of-domain and real-recording tests that show reasonable generalization. Those pieces are the concrete contributions worth noting. The method is a direct engineering response to a known limitation of distilled consistency models, and the reported speed-quality trade-off is the part that would interest people building real-time audio tools. The results on the main dataset look competitive with current numbers in the field. The soft spot is the causal claim for the auxiliary losses. The paper does not appear to include an ablation that isolates them against a pure consistency-distillation baseline on clean or near-clean inputs, nor does it report any metric degradation on clean VoiceBank-DEMAND utterances. Without that check it remains possible that the auxiliaries simply act as a stronger denoiser rather than specifically correcting trajectory bias. That gap does not invalidate the overall numbers, but it leaves the explanation for why the student beats the teacher less secure than the abstract suggests. Readers working on low-latency speech enhancement or on consistency models for audio would get practical value from the implementation choices and the reported speed-up. The work is coherent on its own terms and engages the relevant prior distillation literature, so it is worth sending to referees even if the auxiliary-loss story needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces ROSE-CD, a one-step consistency distillation framework for diffusion-based speech enhancement. It proposes randomized trajectory sampling during distillation combined with joint optimization using two time-domain auxiliary losses (waveform and spectrogram consistency) to mitigate the teacher's trajectory bias and inherited errors. The central claims are that this yields the first pure one-step model for the task, achieves 54x faster inference than the 30-step teacher, surpasses the teacher in performance, and reaches SOTA results on VoiceBank-DEMAND while generalizing to out-of-domain and real-world data.

Significance. If the auxiliary-loss correction mechanism is validated, the result would be significant for real-time speech enhancement, as it demonstrates that a distilled one-step model can exceed its multi-step diffusion teacher in quality while providing substantial speed-up. The empirical focus on generalization to real recordings and out-of-domain sets strengthens the practical relevance; the work also supplies a concrete engineering recipe (randomized trajectory plus auxiliary terms) that could be tested in other diffusion-based audio tasks.

major comments (2)

[§3.2] §3.2 (Method): The claim that randomized trajectory sampling plus the two auxiliary losses allows the student to 'recover from teacher-induced errors and surpass the teacher' is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates these components against a pure consistency-distillation baseline (i.e., without auxiliary losses) on the same teacher.
[§4.2] §4.2 (Experiments, VoiceBank-DEMAND results): Superiority to the 30-step teacher is reported, but no PESQ/STOI scores or listening-test results are given for clean (or near-clean) utterances from the same dataset; without this check it remains possible that the auxiliary losses act as an additional denoiser rather than correcting trajectory bias without introducing new artifacts.

minor comments (2)

[Abstract] The abstract states '54 times faster inference speed' but the main text should explicitly report the measured wall-clock latency (including any overhead from auxiliary loss computation at inference) rather than relying solely on step count.
[§3] Notation for the consistency function and the auxiliary loss weights is introduced without a consolidated table; adding such a table would improve readability of the joint optimization objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Method): The claim that randomized trajectory sampling plus the two auxiliary losses allows the student to 'recover from teacher-induced errors and surpass the teacher' is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates these components against a pure consistency-distillation baseline (i.e., without auxiliary losses) on the same teacher.

Authors: We agree that an ablation isolating the randomized trajectory sampling and the two auxiliary losses against a pure consistency-distillation baseline (without auxiliary losses) on the identical teacher would strengthen the validation of the central claim. In the revised manuscript we will add this ablation study, reporting PESQ, STOI, and perceptual metrics for the baseline, the randomized-trajectory-only variant, the auxiliary-loss-only variant, and the full ROSE-CD model. revision: yes
Referee: [§4.2] §4.2 (Experiments, VoiceBank-DEMAND results): Superiority to the 30-step teacher is reported, but no PESQ/STOI scores or listening-test results are given for clean (or near-clean) utterances from the same dataset; without this check it remains possible that the auxiliary losses act as an additional denoiser rather than correcting trajectory bias without introducing new artifacts.

Authors: We acknowledge that scores on clean or near-clean utterances are needed to confirm the auxiliary losses correct trajectory bias rather than simply providing extra denoising. In the revision we will report PESQ and STOI on the clean reference utterances from VoiceBank-DEMAND for both the teacher and the student, together with a small-scale listening test on these clean samples to verify that no new artifacts are introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with experimental validation

full rationale

The paper describes an engineering method (ROSE-CD) that adds randomized trajectory sampling and two time-domain auxiliary losses to a consistency-distillation pipeline for speech enhancement. All performance claims (54x speedup, SOTA metrics on VoiceBank-DEMAND, out-of-domain generalization) are presented as outcomes of training and testing on external datasets rather than as quantities derived by construction from fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are shown that reduce the central result to the inputs by definition. The work is therefore self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method appears to rest on standard supervised training assumptions for diffusion and consistency models.

pith-pipeline@v0.9.0 · 5751 in / 1025 out tokens · 44174 ms · 2026-05-22T00:05:02.995555+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Subtraction,

J. Meyer and K. U. Simmer, “Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Subtraction,” in 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol. 2. IEEE, 1997, pp. 1167–1170

work page 1997
[2]

Benesty, I

J. Benesty, I. Cohen, and J. Chen, Fundamentals of Signal Enhancement and Array Signal Processing . John Wiley & Sons, 2017

work page 2017
[3]

An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Sep- aration,

J. Chua, L. F. Yan, and W. B. Kleijn, “An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Sep- aration,” in 2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2024, pp. 130–134

work page 2024
[4]

Neural Optimisation of Fixed Beamformers With Flexible Geometric Constraints,

L. F. Yan, W. Huang, T. D. Abhayapala, J. Feng, and W. B. Kleijn, “Neural Optimisation of Fixed Beamformers With Flexible Geometric Constraints,” IEEE Transactions on Audio, Speech and Language Processing , 2025

work page 2025
[5]

HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,

J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . IEEE, 2021, pp. 166–170

work page 2021
[6]

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement,

M. Strauss, N. Pia, N. K. Rao, and B. Edler, “SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement,” in 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . IEEE, 2023, pp. 1–5

work page 2023
[7]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,

S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y . Tsao, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Interspeech 2021 , 2021, pp. 201–205

work page 2021
[8]

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM transactions on audio, speech, and language processing , vol. 27, no. 8, pp. 1256–1266, 2019

work page 2019
[9]

Conditional Diffusion Probabilistic Model for Speech Enhancement,

Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406

work page 2022
[10]

Investigating Training Objectives for Generative Speech Enhancement,

J. Richter, D. de Oliveira, and T. Gerkmann, “Investigating Training Objectives for Generative Speech Enhancement,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2025

work page 2025
[11]

Speech Enhancement and Dereverberation With Diffusion-Based Generative Models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion-Based Generative Models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

work page 2023
[12]

Schr ¨odinger Bridge for Generative Speech Enhancement,

A. Juki ´c, R. Korostik, J. Balam, and B. Ginsburg, “Schr ¨odinger Bridge for Generative Speech Enhancement,” in Interspeech 2024 , 2024, pp. 1175–1179

work page 2024
[13]

Single and few-step diffusion for generative speech enhancement,

B. Lay, J.-M. Lermercier, J. Richter, and T. Gerkmann, “Single and few-step diffusion for generative speech enhancement,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 626–630

work page 2024
[14]

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

work page 2023
[15]

Thunder: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,

T. Trachu, C. Piansaddhayanon, and E. Chuangsuwanich, “Thunder: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,” in Interspeech 2024, 2024, pp. 1180–1184

work page 2024
[16]

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders,

H. Shi, K. Shimada, M. Hirano, T. Shibuya, Y . Koyama, Z. Zhong, S. Takahashi, T. Kawahara, and Y . Mitsufuji, “Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP) . IEEE, 2024, pp. 12 951–12 955

work page 2024
[17]

Karatzas and S

I. Karatzas and S. Shreve, Brownian motion and stochastic calculus . Springer Science & Business Media, 1991, vol. 113

work page 1991
[18]

Consistency Models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” in Proceedings of the 40th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 32 211–32 252

work page 2023
[19]

SE-Bridge: Speech Enhancement with Consistent Brownian Bridge,

Z. Qiu, M. Fu, F. Sun, G. Altenbek, and H. Huang, “SE-Bridge: Speech Enhancement with Consistent Brownian Bridge,” 2023

work page 2023
[20]

The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,

D. de Oliveira, S. Welker, J. Richter, and T. Gerkmann, “The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,” in Interspeech 2024 . Kos, Greece: ISCA, Sep. 2024, pp. 3854–3858

work page 2024
[21]

Generative Modeling by Estimating Gradients of the Data Distribution,

Y . Song and S. Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution,” Advances in neural information processing systems, vol. 32, 2019

work page 2019
[22]

Score-Based Generative Modeling through Stochastic Differen- tial Equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differen- tial Equations,” in International Conference on Learning Representations , 2021

work page 2021
[23]

Estimation of Non-Normalized Statistical Models by Score Matching,

A. Hyv ¨arinen and P. Dayan, “Estimation of Non-Normalized Statistical Models by Score Matching,” Journal of Machine Learning Research , vol. 6, no. 4, 2005

work page 2005
[24]

A Connection Between Score Matching and Denoising Autoencoders,

P. Vincent, “A Connection Between Score Matching and Denoising Autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011

work page 2011
[25]

Elucidating the Design Space of Diffusion-Based Generative Models,

T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the Design Space of Diffusion-Based Generative Models,” Advances in neural information processing systems, vol. 35, pp. 26 565–26 577, 2022

work page 2022
[26]

A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality,

J. M. Martin-Do ˜nas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality,” IEEE Signal Processing Letters , vol. 25, no. 11, pp. 1680–1684, 2018

work page 2018
[27]

End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization,

J. Kim, M. El-Khamy, and J. Lee, “End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization,” arXiv preprint arXiv:1901.09146 , 2019

work page arXiv 1901
[28]

SDR – Half-baked or Well Done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626–630

work page 2019
[29]

Investigating RNN-based speech enhancement methods for noise-robust Text-to- Speech,

C. V . Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to- Speech,” in 9th ISCA speech synthesis workshop , 2016, pp. 159–165

work page 2016
[30]

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics , vol. 19, no. 1. AIP Publishing, 2013

work page 2013
[31]

TIMIT Acoustic-Phonetic Continuous Speech Corpus,

J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V . Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” 1993, accessed via LDC

work page 1993
[32]

Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,

A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993

work page 1993
[33]

The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. A. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” in Interspeech 2020 , 2020, pp. 2492–2496

work page 2020
[34]

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE interna- tional conference on acoustics, speech, and signal processing (ICASSP) , vol. 2. IEEE, 2001, pp. 749–752

work page 2001
[35]

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,

J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 24, no. 11, pp. 2009– 2022, 2016

work page 2009
[36]

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement,

P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[37]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Advances in neural information processing systems , vol. 33, pp. 12 449– 12 460, 2020

work page 2020
[38]

DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppres- sors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppres- sors,” in 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, June 2021

work page 2021
[39]

DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” in 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 886–890

work page 2022
[40]

Generalization ability of mos prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” in 2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 8442–8446

work page 2022

[1] [1]

Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Subtraction,

J. Meyer and K. U. Simmer, “Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Subtraction,” in 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol. 2. IEEE, 1997, pp. 1167–1170

work page 1997

[2] [2]

Benesty, I

J. Benesty, I. Cohen, and J. Chen, Fundamentals of Signal Enhancement and Array Signal Processing . John Wiley & Sons, 2017

work page 2017

[3] [3]

An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Sep- aration,

J. Chua, L. F. Yan, and W. B. Kleijn, “An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Sep- aration,” in 2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2024, pp. 130–134

work page 2024

[4] [4]

Neural Optimisation of Fixed Beamformers With Flexible Geometric Constraints,

L. F. Yan, W. Huang, T. D. Abhayapala, J. Feng, and W. B. Kleijn, “Neural Optimisation of Fixed Beamformers With Flexible Geometric Constraints,” IEEE Transactions on Audio, Speech and Language Processing , 2025

work page 2025

[5] [5]

HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,

J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . IEEE, 2021, pp. 166–170

work page 2021

[6] [6]

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement,

M. Strauss, N. Pia, N. K. Rao, and B. Edler, “SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement,” in 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . IEEE, 2023, pp. 1–5

work page 2023

[7] [7]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,

S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y . Tsao, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Interspeech 2021 , 2021, pp. 201–205

work page 2021

[8] [8]

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM transactions on audio, speech, and language processing , vol. 27, no. 8, pp. 1256–1266, 2019

work page 2019

[9] [9]

Conditional Diffusion Probabilistic Model for Speech Enhancement,

Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406

work page 2022

[10] [10]

Investigating Training Objectives for Generative Speech Enhancement,

J. Richter, D. de Oliveira, and T. Gerkmann, “Investigating Training Objectives for Generative Speech Enhancement,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2025

work page 2025

[11] [11]

Speech Enhancement and Dereverberation With Diffusion-Based Generative Models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion-Based Generative Models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

work page 2023

[12] [12]

Schr ¨odinger Bridge for Generative Speech Enhancement,

A. Juki ´c, R. Korostik, J. Balam, and B. Ginsburg, “Schr ¨odinger Bridge for Generative Speech Enhancement,” in Interspeech 2024 , 2024, pp. 1175–1179

work page 2024

[13] [13]

Single and few-step diffusion for generative speech enhancement,

B. Lay, J.-M. Lermercier, J. Richter, and T. Gerkmann, “Single and few-step diffusion for generative speech enhancement,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 626–630

work page 2024

[14] [14]

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

work page 2023

[15] [15]

Thunder: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,

T. Trachu, C. Piansaddhayanon, and E. Chuangsuwanich, “Thunder: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,” in Interspeech 2024, 2024, pp. 1180–1184

work page 2024

[16] [16]

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders,

H. Shi, K. Shimada, M. Hirano, T. Shibuya, Y . Koyama, Z. Zhong, S. Takahashi, T. Kawahara, and Y . Mitsufuji, “Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP) . IEEE, 2024, pp. 12 951–12 955

work page 2024

[17] [17]

Karatzas and S

I. Karatzas and S. Shreve, Brownian motion and stochastic calculus . Springer Science & Business Media, 1991, vol. 113

work page 1991

[18] [18]

Consistency Models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” in Proceedings of the 40th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 32 211–32 252

work page 2023

[19] [19]

SE-Bridge: Speech Enhancement with Consistent Brownian Bridge,

Z. Qiu, M. Fu, F. Sun, G. Altenbek, and H. Huang, “SE-Bridge: Speech Enhancement with Consistent Brownian Bridge,” 2023

work page 2023

[20] [20]

The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,

D. de Oliveira, S. Welker, J. Richter, and T. Gerkmann, “The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,” in Interspeech 2024 . Kos, Greece: ISCA, Sep. 2024, pp. 3854–3858

work page 2024

[21] [21]

Generative Modeling by Estimating Gradients of the Data Distribution,

Y . Song and S. Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution,” Advances in neural information processing systems, vol. 32, 2019

work page 2019

[22] [22]

Score-Based Generative Modeling through Stochastic Differen- tial Equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differen- tial Equations,” in International Conference on Learning Representations , 2021

work page 2021

[23] [23]

Estimation of Non-Normalized Statistical Models by Score Matching,

A. Hyv ¨arinen and P. Dayan, “Estimation of Non-Normalized Statistical Models by Score Matching,” Journal of Machine Learning Research , vol. 6, no. 4, 2005

work page 2005

[24] [24]

A Connection Between Score Matching and Denoising Autoencoders,

P. Vincent, “A Connection Between Score Matching and Denoising Autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011

work page 2011

[25] [25]

Elucidating the Design Space of Diffusion-Based Generative Models,

T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the Design Space of Diffusion-Based Generative Models,” Advances in neural information processing systems, vol. 35, pp. 26 565–26 577, 2022

work page 2022

[26] [26]

A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality,

J. M. Martin-Do ˜nas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality,” IEEE Signal Processing Letters , vol. 25, no. 11, pp. 1680–1684, 2018

work page 2018

[27] [27]

End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization,

J. Kim, M. El-Khamy, and J. Lee, “End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization,” arXiv preprint arXiv:1901.09146 , 2019

work page arXiv 1901

[28] [28]

SDR – Half-baked or Well Done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626–630

work page 2019

[29] [29]

Investigating RNN-based speech enhancement methods for noise-robust Text-to- Speech,

C. V . Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to- Speech,” in 9th ISCA speech synthesis workshop , 2016, pp. 159–165

work page 2016

[30] [30]

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics , vol. 19, no. 1. AIP Publishing, 2013

work page 2013

[31] [31]

TIMIT Acoustic-Phonetic Continuous Speech Corpus,

J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V . Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” 1993, accessed via LDC

work page 1993

[32] [32]

Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,

A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993

work page 1993

[33] [33]

The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. A. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” in Interspeech 2020 , 2020, pp. 2492–2496

work page 2020

[34] [34]

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE interna- tional conference on acoustics, speech, and signal processing (ICASSP) , vol. 2. IEEE, 2001, pp. 749–752

work page 2001

[35] [35]

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,

J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 24, no. 11, pp. 2009– 2022, 2016

work page 2009

[36] [36]

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement,

P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[37] [37]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Advances in neural information processing systems , vol. 33, pp. 12 449– 12 460, 2020

work page 2020

[38] [38]

DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppres- sors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppres- sors,” in 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, June 2021

work page 2021

[39] [39]

DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” in 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 886–890

work page 2022

[40] [40]

Generalization ability of mos prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” in 2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 8442–8446

work page 2022