Speech Enhancement Based on Drifting Models

Diego Caviedes-Nozal; Liang Xu; Longfei Felix Yan; Rasmus Kongsgaard Olsson; W. Bastiaan Kleijn

arxiv: 2604.24199 · v3 · pith:R2WGVD74new · submitted 2026-04-27 · 💻 cs.SD · cs.AI· eess.AS· eess.SP

Speech Enhancement Based on Drifting Models

Liang Xu , Diego Caviedes-Nozal , W. Bastiaan Kleijn , Longfei Felix Yan , Rasmus Kongsgaard Olsson This is my paper

Pith reviewed 2026-05-21 08:54 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.ASeess.SP

keywords speech enhancementgenerative modelsone-step inferencedrifting fieldunpaired trainingdistribution matchingVoiceBank-DEMAND

0 comments

The pith

DriftSE performs high-fidelity speech enhancement in one forward pass by learning a drifting field that shifts noisy inputs to match the clean speech distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DriftSE as a generative approach that reframes speech denoising as reaching an equilibrium state instead of running repeated correction steps. A learned drifting field acts as a correction vector that moves the distribution of processed noisy audio directly toward the areas of high probability in clean speech. Because the method matches entire distributions rather than individual paired examples, it can train on separate collections of noisy and clean recordings. On standard benchmarks the single-step outputs exceed the quality of diffusion-based systems that require many iterations. If this holds, real-time enhancement systems could run faster while using less paired training data.

Core claim

DriftSE formulates denoising as an equilibrium problem in which the pushforward distribution of a mapping function is evolved by a drifting field until it coincides with the clean speech distribution. The framework admits both a direct mapping from the noisy observation and a stochastic conditional model starting from a Gaussian prior. This construction yields one-step inference and permits training without paired noisy-clean samples because only distributional alignment is required.

What carries the argument

The Drifting Field, a learned correction vector that steers the distribution of mapped noisy samples toward high-density regions of clean speech.

If this is right

Speech enhancement becomes feasible in a single network evaluation rather than repeated sampling passes.
Training can proceed with unpaired noisy recordings and separate clean recordings by aligning their distributions.
The same equilibrium formulation may apply to other audio restoration tasks that currently rely on iterative generative models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the drifting field generalizes across recording conditions, the method could support on-device enhancement that adapts to new noise environments without fresh paired data collection.
Similar distribution-matching corrections might replace iterative sampling in related one-dimensional signal tasks such as music source separation.

Load-bearing premise

A correction vector can be trained so that one application moves the entire distribution of noisy speech directly onto the clean speech distribution without paired examples or repeated adjustments.

What would settle it

On the VoiceBank-DEMAND test set, single-step DriftSE outputs that score lower than multi-step diffusion baselines on standard perceptual metrics such as PESQ or STOI would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2604.24199 by Diego Caviedes-Nozal, Liang Xu, Longfei Felix Yan, Rasmus Kongsgaard Olsson, W. Bastiaan Kleijn.

**Figure 1.** Figure 1: Overview of the DriftSE framework (illustrating the Direct Mapping formulation). force pulling zi toward the clean distribution Z + and a repulsion force pushing it away from the current generated distribution Z −, driving fθ toward equilibrium. Training Objective: To capture hierarchical speech structures, the base drifting loss from (9) is computed and aggregated across multiple layers l ∈ S of the l… view at source ↗

**Figure 2.** Figure 2: Evolution of frame-level distributions in the DistilHuBERT semantic space for a fixed test utterance. Each panel displays 2D density contours (PCA projection) derived from all frames across different training epochs. Stars denote the corresponding centroids, which represent the mean of all projected frames. As training progresses, the generated distribution shifts from the noisy distribution toward the cle… view at source ↗

read the original abstract

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes DriftSE, a generative framework for speech enhancement formulated as an equilibrium problem. A learned Drifting Field evolves the pushforward distribution of a mapping function to match the clean speech distribution, enabling native one-step inference and unpaired training via distribution matching rather than paired regression. Two formulations are considered: direct mapping from the noisy observation and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark report that DriftSE outperforms multi-step diffusion baselines while achieving high-fidelity enhancement.

Significance. If the central claims are substantiated, this work would offer a meaningful efficiency advance for generative speech enhancement by replacing iterative sampling with a single forward pass. The explicit support for unpaired training through distribution matching is a clear strength that could extend applicability in data-scarce settings. The manuscript is credited for introducing a new equilibrium-based paradigm and for attempting to derive one-step superiority directly from the drifting-field construction.

major comments (1)

[§3 (Drifting Field and Equilibrium Formulation)] §3 (Drifting Field and Equilibrium Formulation): The central claim that the drifting field evolves the pushforward of the mapping function to exactly match the clean distribution in one step, while preserving input-specific content, carries a correctness-risk concern. The equilibrium condition ensures only marginal distribution matching; without explicit content-preservation terms (e.g., phonetic or speaker-identity constraints) or paired-sample supervision, the learned mapping could converge to any high-density clean sample. A concrete test is required: report ASR word-error-rate or speaker-similarity metrics on the enhanced outputs versus paired diffusion baselines to verify that intelligibility and speaker identity are retained rather than traded for marginal fidelity.

minor comments (3)

[Experiments] The abstract states outperformance on VoiceBank-DEMAND but the main text should include the full set of quantitative results (PESQ, STOI, etc.) with error bars and statistical significance tests against the cited diffusion baselines.
[§3] Clarify the precise difference in the loss functions and sampling procedures between the direct-mapping and stochastic-conditional formulations; a side-by-side equation comparison would improve readability.
[Discussion] Add a limitations paragraph discussing potential failure modes such as content hallucination or sensitivity to the choice of clean-speech prior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. The concern regarding content preservation under the equilibrium formulation is well-taken, and we address it directly below. We believe incorporating the suggested evaluations will strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§3 (Drifting Field and Equilibrium Formulation)] §3 (Drifting Field and Equilibrium Formulation): The central claim that the drifting field evolves the pushforward of the mapping function to exactly match the clean distribution in one step, while preserving input-specific content, carries a correctness-risk concern. The equilibrium condition ensures only marginal distribution matching; without explicit content-preservation terms (e.g., phonetic or speaker-identity constraints) or paired-sample supervision, the learned mapping could converge to any high-density clean sample. A concrete test is required: report ASR word-error-rate or speaker-similarity metrics on the enhanced outputs versus paired diffusion baselines to verify that intelligibility and speaker identity are retained rather than traded for marginal fidelity.

Authors: We agree that the equilibrium condition enforces marginal matching and that, without additional constraints, there is a theoretical risk of the mapping converging to any high-density clean sample rather than preserving input-specific content. In the DriftSE formulation the drifting field is explicitly conditioned on the noisy observation, which we posit supplies an implicit content-preserving mechanism by evolving the input-specific pushforward rather than sampling from an unconditional clean prior. Nevertheless, this remains an implicit argument. To directly substantiate the claim and address the referee's request, we will add new experiments in the revised manuscript reporting ASR word-error rates (using a standard pre-trained ASR system) and speaker-similarity metrics (cosine similarity of speaker embeddings) on the enhanced outputs, comparing DriftSE against the multi-step diffusion baselines on VoiceBank-DEMAND. These results will be included in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DriftSE derivation chain

full rationale

The paper introduces DriftSE as a generative framework that formulates denoising as an equilibrium problem solved via a learned drifting field evolving the pushforward distribution to match the clean speech distribution in one step. This is presented as a modeling choice enabling unpaired training and single-step inference, with empirical validation on VoiceBank-DEMAND. No quoted equations or sections demonstrate a self-definitional reduction where the claimed one-step result equals the input by construction, nor any fitted parameter renamed as prediction, load-bearing self-citation, or ansatz smuggled via prior work. The central premise relies on the field's learned behavior to reach high-density regions, which remains an independent empirical claim rather than a tautology. The derivation is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that a learned correction vector exists that can evolve any pushforward to the clean distribution; no free parameters or invented entities beyond the drifting field are detailed in the abstract.

axioms (1)

domain assumption The clean speech distribution can be matched by evolving a mapping function's pushforward via a drifting field without paired data.
Invoked in the formulation of denoising as equilibrium problem.

invented entities (1)

Drifting Field no independent evidence
purpose: Learned correction vector guiding samples to high-density clean speech regions.
Central new component enabling one-step inference and unpaired training.

pith-pipeline@v0.9.0 · 5690 in / 1174 out tokens · 27966 ms · 2026-05-21T08:54:07.796060+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Drifting Field V_{p,q}(x) = V_p^+(x) - V_q^-(x) ... kernel-weighted mean shift ... equilibrium where the drift vanishes q_θ = p_data ⟹ V_{p,q}(x) = 0
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

evolving the pushforward distribution of a mapping function to directly match the clean speech distribution ... one-step inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Introduction The field of Speech Enhancement (SE) has evolved significantly over recent decades, progressing from classical statistical sig- nal processing techniques like Wiener filtering [1, 2] to mod- ern deep learning. Discriminative models, such as RNNs [3], LSTMs [4], and complex spectral mapping [5], effectively sup- press noise but often yield spe...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Drifting Models We briefly review Drifting Models [23], which formulate gener- ative modeling as the training-time evolution of a pushforward distribution. 2.1. Pushforward and Equilibrium Given a simple source distributionp ϵ (e.g., standard Gaussian noiseN(0,I)), the drift approach takes a sampleϵ∼p ϵ with ϵ∈R d, and maps it through a parameterized func...

work page
[3]

Speech Enhancement via Latent Drifting We propose DriftSE, which formulates speech enhancement as an equilibrium problem (Fig. 1). By evolving the mapping func- tion’s pushforward distribution to match the clean speech distri- bution, DriftSE achieves native one-step denoising (1 NFE). 3.1. Two Enhancement Paradigms Lety∈C F×T denote the complex spectrogr...

work page
[4]

unpaired

Experiments In this section, we evaluate DriftSE against state-of-the-art it- erative and one-step baselines, and perform ablation studies to analyze the contribution of each design choice. 4.1. Experimental Setup Datasets:We train on clean speech from the V oiceBank cor- pus [28] and noise recordings from the DEMAND dataset [29]. During training, we empl...

work page 2020
[5]

By utilizing a latent drifting field, DriftSE evolves the mapping function’s pushforward distribution to directly match the clean speech distribution during training

Conclusion In this paper, we introduced Speech Enhancement based on Drifting Models (DriftSE), a novel paradigm that reformulates denoising as an equilibrium problem to enable native one-step generation. By utilizing a latent drifting field, DriftSE evolves the mapping function’s pushforward distribution to directly match the clean speech distribution dur...

work page
[6]

Generative AI Use Disclosure We acknowledge the ISCA policy stating that generative AI tools cannot serve as co-authors and should only be used for editing or polishing rather than producing significant parts of this paper. Although the proposed method is a novel generative model for speech enhancement, the authors declare that no gen- erative AI tools we...

work page
[7]

Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Sub- traction,

J. Meyer and K. U. Simmer, “Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Sub- traction,” inICASSP, vol. 2. IEEE, 1997, pp. 1167–1170

work page 1997
[8]

An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Separation,

J. Chua, L. F. Yan, and W. B. Kleijn, “An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Separation,” in2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2024, pp. 130–134

work page 2024
[9]

Speech enhancement with LSTM recurrent neural networks and its ap- plication to noise-robust ASR,

F. Weninger, J. R. Hershey, J. L. Roux, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its ap- plication to noise-robust ASR,” inLatent Variable Analysis and Signal Separation (LVA/ICA), vol. 9237, Aug. 2015, pp. 91–99

work page 2015
[10]

A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,

K. Tan and D. Wang, “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,” inProc. Interspeech, 2018, pp. 3229–3233

work page 2018
[11]

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,

Y . Hu, Y . Liu, S. Lyu, M. Xing, S. Zhang, Y . Fu, J. Fan, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” inProc. Interspeech, 2020, pp. 2472–2476

work page 2020
[12]

SEGAN: Speech En- hancement Generative Adversarial Network,

S. Pascual, A. Bonafonte, and J. Serr `a, “SEGAN: Speech En- hancement Generative Adversarial Network,” inProc. Inter- speech, 2017, pp. 3642–3646

work page 2017
[13]

MetricGAN: Gen- erative Adversarial Networks based Black-box Metric Scores Op- timization for Speech Enhancement,

S.-W. Fu, C.-F. Liao, Y . Tsao, and S.-D. Lin, “MetricGAN: Gen- erative Adversarial Networks based Black-box Metric Scores Op- timization for Speech Enhancement,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 2031–2041

work page 2019
[14]

HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Con- ditioned on Acoustic Features,

J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Con- ditioned on Acoustic Features,” in2021 IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 166–170

work page 2021
[15]

Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

work page 2023
[16]

Score-Based Generative Modeling through Stochastic Differential Equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- mon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” inICLR, 2021

work page 2021
[17]

StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

work page 2023
[18]

Thun- der: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,

T. Trachu, C. Piansaddhayanon, and E. Chuangsuwanich, “Thun- der: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,” inProc. Inter- speech, 2024, pp. 1180–1184

work page 2024
[19]

Few-step Adversarial Schr¨odinger Bridge for Generative Speech Enhancement,

S. Han, S. Lee, J. Lee, and K. Lee, “Few-step Adversarial Schr¨odinger Bridge for Generative Speech Enhancement,” in Proc. Interspeech, 2025, pp. 2380–2384

work page 2025
[20]

Consistency Models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Re- search, vol. 202. PMLR, 23–29 Jul 2023, pp. 32 211–32 252

work page 2023
[21]

Consistency Trajectory Mod- els: Learning Probability Flow ODE Trajectory of Diffusion,

D. Kim, C.-H. Lai, W.-H. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Ermon, “Consistency Trajectory Mod- els: Learning Probability Flow ODE Trajectory of Diffusion,” in ICLR, 2024

work page 2024
[22]

Robust One-Step Speech Enhancement via Consistency Distillation,

L. Xu, L. F. Yan, and W. B. Kleijn, “Robust One-Step Speech Enhancement via Consistency Distillation,” inProc. IEEE Work- shop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Tahoe City, CA, USA: IEEE, Oct. 2025

work page 2025
[23]

Schr¨odinger Bridge Consistency Trajectory Mod- els for Speech Enhancement,

S. Nishigori, K. Saito, N. Murata, M. Hirano, S. Takahashi, and Y . Mitsufuji, “Schr¨odinger Bridge Consistency Trajectory Mod- els for Speech Enhancement,” in2025 IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2025, pp. 1–5

work page 2025
[24]

Flow Matching for Generative Modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow Matching for Generative Modeling,” inICLR, 2023

work page 2023
[25]

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,” inProc. Interspeech, 2025

work page 2025
[26]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow,

X. Liu, C. Gong, and Q. Liu, “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow,” inICLR, 2023

work page 2023
[27]

Mean Flows for One-step Generative Modeling,

Z. Geng, M. Deng, X. Bai, Z. Kolter, and K. He, “Mean Flows for One-step Generative Modeling,” inNeurIPS, 2025

work page 2025
[28]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Z. Geng, Y . Lu, Z. Wu, E. Shechtman, J. Z. Kolter, and K. He, “Improved Mean Flows: On the Challenges of Fastforward Gen- erative Models,”arXiv preprint arXiv:2512.02012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Generative Modeling via Drifting

M. Deng, H. Li, T. Li, Y . Du, and K. He, “Generative Modeling via Drifting,”arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Mean shift, mode seeking, and clustering,

Y . Cheng, “Mean shift, mode seeking, and clustering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790–799, 1995

work page 1995
[31]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A.-r. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021
[32]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[33]

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden- unit BERT,

H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden- unit BERT,” inICASSP. IEEE, 2022, pp. 7087–7091

work page 2022
[34]

In- vestigating RNN-based speech enhancement methods for noise- robust Text-to-Speech,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “In- vestigating RNN-based speech enhancement methods for noise- robust Text-to-Speech,” in9th ISCA Workshop on Speech Synthe- sis Workshop (SSW 9), 2016, pp. 146–152

work page 2016
[35]

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” inProceedings of Meetings on Acoustics, vol. 19, no. 1. AIP Publishing, 2013

work page 2013
[36]

The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. A. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” inProc. Interspeech, 2020, pp. 2492–2496

work page 2020
[37]

Per- ceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Quality Assessment of Telephone Networks and Codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” inICASSP, vol. 2. IEEE, 2001, pp. 749–752

work page 2001
[38]

An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,

J. Jensen and C. H. Taal, “An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 24, no. 11, pp. 2009–2022, 2016

work page 2009
[39]

Sdr – half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” inICASSP, 2019, pp. 626–630

work page 2019
[40]

SCOREQ: Speech Qual- ity Assessment with Contrastive Regression,

A. Ragano, J. Skoglund, and A. Hines, “SCOREQ: Speech Qual- ity Assessment with Contrastive Regression,” inNeurIPS, vol. 37, 2024, pp. 105 702–105 729

work page 2024
[41]

DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” inICASSP. IEEE, August 2021

work page 2021
[42]

DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” inICASSP. IEEE, 2022, pp. 886–890

work page 2022
[43]

HiFi++: a Uni- fied Framework for Bandwidth Extension and Speech Enhance- ment,

P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a Uni- fied Framework for Bandwidth Extension and Speech Enhance- ment,” inICASSP. IEEE, 2023, pp. 1–5

work page 2023
[44]

Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,

D. Li, S. Lu, H. Pan, Z. Zhan, Q. Hong, and L. Li, “Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,” inICASSP, 2026

work page 2026
[45]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,

S.-W. Fu, C. Yu, Y . Tsao, X. Lu, and H. Kawahara, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” inProc. Interspeech, 2021, pp. 201–205

work page 2021
[46]

UNI- VERSE++: Universal Score-based Speech Enhancement with High Content Preservation,

R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “UNI- VERSE++: Universal Score-based Speech Enhancement with High Content Preservation,” inProc. Interspeech, 2024

work page 2024

[1] [1]

Introduction The field of Speech Enhancement (SE) has evolved significantly over recent decades, progressing from classical statistical sig- nal processing techniques like Wiener filtering [1, 2] to mod- ern deep learning. Discriminative models, such as RNNs [3], LSTMs [4], and complex spectral mapping [5], effectively sup- press noise but often yield spe...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Drifting Models We briefly review Drifting Models [23], which formulate gener- ative modeling as the training-time evolution of a pushforward distribution. 2.1. Pushforward and Equilibrium Given a simple source distributionp ϵ (e.g., standard Gaussian noiseN(0,I)), the drift approach takes a sampleϵ∼p ϵ with ϵ∈R d, and maps it through a parameterized func...

work page

[3] [3]

Speech Enhancement via Latent Drifting We propose DriftSE, which formulates speech enhancement as an equilibrium problem (Fig. 1). By evolving the mapping func- tion’s pushforward distribution to match the clean speech distri- bution, DriftSE achieves native one-step denoising (1 NFE). 3.1. Two Enhancement Paradigms Lety∈C F×T denote the complex spectrogr...

work page

[4] [4]

unpaired

Experiments In this section, we evaluate DriftSE against state-of-the-art it- erative and one-step baselines, and perform ablation studies to analyze the contribution of each design choice. 4.1. Experimental Setup Datasets:We train on clean speech from the V oiceBank cor- pus [28] and noise recordings from the DEMAND dataset [29]. During training, we empl...

work page 2020

[5] [5]

By utilizing a latent drifting field, DriftSE evolves the mapping function’s pushforward distribution to directly match the clean speech distribution during training

Conclusion In this paper, we introduced Speech Enhancement based on Drifting Models (DriftSE), a novel paradigm that reformulates denoising as an equilibrium problem to enable native one-step generation. By utilizing a latent drifting field, DriftSE evolves the mapping function’s pushforward distribution to directly match the clean speech distribution dur...

work page

[6] [6]

Generative AI Use Disclosure We acknowledge the ISCA policy stating that generative AI tools cannot serve as co-authors and should only be used for editing or polishing rather than producing significant parts of this paper. Although the proposed method is a novel generative model for speech enhancement, the authors declare that no gen- erative AI tools we...

work page

[7] [7]

Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Sub- traction,

J. Meyer and K. U. Simmer, “Multi-channel Speech Enhancement in a Car Environment Using Wiener Filtering and Spectral Sub- traction,” inICASSP, vol. 2. IEEE, 1997, pp. 1167–1170

work page 1997

[8] [8]

An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Separation,

J. Chua, L. F. Yan, and W. B. Kleijn, “An Effective MVDR Post- Processing Method for Low-Latency Convolutive Blind Source Separation,” in2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2024, pp. 130–134

work page 2024

[9] [9]

Speech enhancement with LSTM recurrent neural networks and its ap- plication to noise-robust ASR,

F. Weninger, J. R. Hershey, J. L. Roux, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its ap- plication to noise-robust ASR,” inLatent Variable Analysis and Signal Separation (LVA/ICA), vol. 9237, Aug. 2015, pp. 91–99

work page 2015

[10] [10]

A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,

K. Tan and D. Wang, “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,” inProc. Interspeech, 2018, pp. 3229–3233

work page 2018

[11] [11]

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,

Y . Hu, Y . Liu, S. Lyu, M. Xing, S. Zhang, Y . Fu, J. Fan, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” inProc. Interspeech, 2020, pp. 2472–2476

work page 2020

[12] [12]

SEGAN: Speech En- hancement Generative Adversarial Network,

S. Pascual, A. Bonafonte, and J. Serr `a, “SEGAN: Speech En- hancement Generative Adversarial Network,” inProc. Inter- speech, 2017, pp. 3642–3646

work page 2017

[13] [13]

MetricGAN: Gen- erative Adversarial Networks based Black-box Metric Scores Op- timization for Speech Enhancement,

S.-W. Fu, C.-F. Liao, Y . Tsao, and S.-D. Lin, “MetricGAN: Gen- erative Adversarial Networks based Black-box Metric Scores Op- timization for Speech Enhancement,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 2031–2041

work page 2019

[14] [14]

HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Con- ditioned on Acoustic Features,

J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Con- ditioned on Acoustic Features,” in2021 IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 166–170

work page 2021

[15] [15]

Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

work page 2023

[16] [16]

Score-Based Generative Modeling through Stochastic Differential Equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- mon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” inICLR, 2021

work page 2021

[17] [17]

StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

work page 2023

[18] [18]

Thun- der: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,

T. Trachu, C. Piansaddhayanon, and E. Chuangsuwanich, “Thun- der: Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge,” inProc. Inter- speech, 2024, pp. 1180–1184

work page 2024

[19] [19]

Few-step Adversarial Schr¨odinger Bridge for Generative Speech Enhancement,

S. Han, S. Lee, J. Lee, and K. Lee, “Few-step Adversarial Schr¨odinger Bridge for Generative Speech Enhancement,” in Proc. Interspeech, 2025, pp. 2380–2384

work page 2025

[20] [20]

Consistency Models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Re- search, vol. 202. PMLR, 23–29 Jul 2023, pp. 32 211–32 252

work page 2023

[21] [21]

Consistency Trajectory Mod- els: Learning Probability Flow ODE Trajectory of Diffusion,

D. Kim, C.-H. Lai, W.-H. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Ermon, “Consistency Trajectory Mod- els: Learning Probability Flow ODE Trajectory of Diffusion,” in ICLR, 2024

work page 2024

[22] [22]

Robust One-Step Speech Enhancement via Consistency Distillation,

L. Xu, L. F. Yan, and W. B. Kleijn, “Robust One-Step Speech Enhancement via Consistency Distillation,” inProc. IEEE Work- shop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Tahoe City, CA, USA: IEEE, Oct. 2025

work page 2025

[23] [23]

Schr¨odinger Bridge Consistency Trajectory Mod- els for Speech Enhancement,

S. Nishigori, K. Saito, N. Murata, M. Hirano, S. Takahashi, and Y . Mitsufuji, “Schr¨odinger Bridge Consistency Trajectory Mod- els for Speech Enhancement,” in2025 IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2025, pp. 1–5

work page 2025

[24] [24]

Flow Matching for Generative Modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow Matching for Generative Modeling,” inICLR, 2023

work page 2023

[25] [25]

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,” inProc. Interspeech, 2025

work page 2025

[26] [26]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow,

X. Liu, C. Gong, and Q. Liu, “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow,” inICLR, 2023

work page 2023

[27] [27]

Mean Flows for One-step Generative Modeling,

Z. Geng, M. Deng, X. Bai, Z. Kolter, and K. He, “Mean Flows for One-step Generative Modeling,” inNeurIPS, 2025

work page 2025

[28] [28]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Z. Geng, Y . Lu, Z. Wu, E. Shechtman, J. Z. Kolter, and K. He, “Improved Mean Flows: On the Challenges of Fastforward Gen- erative Models,”arXiv preprint arXiv:2512.02012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Generative Modeling via Drifting

M. Deng, H. Li, T. Li, Y . Du, and K. He, “Generative Modeling via Drifting,”arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Mean shift, mode seeking, and clustering,

Y . Cheng, “Mean shift, mode seeking, and clustering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790–799, 1995

work page 1995

[31] [31]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A.-r. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021

[32] [32]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[33] [33]

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden- unit BERT,

H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden- unit BERT,” inICASSP. IEEE, 2022, pp. 7087–7091

work page 2022

[34] [34]

In- vestigating RNN-based speech enhancement methods for noise- robust Text-to-Speech,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “In- vestigating RNN-based speech enhancement methods for noise- robust Text-to-Speech,” in9th ISCA Workshop on Speech Synthe- sis Workshop (SSW 9), 2016, pp. 146–152

work page 2016

[35] [35]

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” inProceedings of Meetings on Acoustics, vol. 19, no. 1. AIP Publishing, 2013

work page 2013

[36] [36]

The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. A. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” inProc. Interspeech, 2020, pp. 2492–2496

work page 2020

[37] [37]

Per- ceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Quality Assessment of Telephone Networks and Codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” inICASSP, vol. 2. IEEE, 2001, pp. 749–752

work page 2001

[38] [38]

An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,

J. Jensen and C. H. Taal, “An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 24, no. 11, pp. 2009–2022, 2016

work page 2009

[39] [39]

Sdr – half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” inICASSP, 2019, pp. 626–630

work page 2019

[40] [40]

SCOREQ: Speech Qual- ity Assessment with Contrastive Regression,

A. Ragano, J. Skoglund, and A. Hines, “SCOREQ: Speech Qual- ity Assessment with Contrastive Regression,” inNeurIPS, vol. 37, 2024, pp. 105 702–105 729

work page 2024

[41] [41]

DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” inICASSP. IEEE, August 2021

work page 2021

[42] [42]

DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” inICASSP. IEEE, 2022, pp. 886–890

work page 2022

[43] [43]

HiFi++: a Uni- fied Framework for Bandwidth Extension and Speech Enhance- ment,

P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: a Uni- fied Framework for Bandwidth Extension and Speech Enhance- ment,” inICASSP. IEEE, 2023, pp. 1–5

work page 2023

[44] [44]

Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,

D. Li, S. Lu, H. Pan, Z. Zhan, Q. Hong, and L. Li, “Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,” inICASSP, 2026

work page 2026

[45] [45]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,

S.-W. Fu, C. Yu, Y . Tsao, X. Lu, and H. Kawahara, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” inProc. Interspeech, 2021, pp. 201–205

work page 2021

[46] [46]

UNI- VERSE++: Universal Score-based Speech Enhancement with High Content Preservation,

R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “UNI- VERSE++: Universal Score-based Speech Enhancement with High Content Preservation,” inProc. Interspeech, 2024

work page 2024