PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

Dahan Wang; Jing Lu; Jun Gao; Xiaobin Rong; Yu Sun

arxiv: 2606.17806 · v1 · pith:FM4QKNBPnew · submitted 2026-06-16 · 📡 eess.AS

PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

Jun Gao , Xiaobin Rong , Yu Sun , Dahan Wang , Jing Lu This is my paper

Pith reviewed 2026-06-26 23:02 UTC · model grok-4.3

classification 📡 eess.AS

keywords speech enhancementflow matchingself-supervised learningphonetic conditioningacoustic representationsneural vocoderlatent space modeling

0 comments

The pith

PhASE-Flow models the conditional distribution of clean acoustic representations given phonetic ones inside SSL latent space to enhance noisy speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a speech enhancement approach that moves flow matching entirely into the representation space produced by self-supervised learning models. It conditions the prediction of clean acoustic features on phonetic features drawn from the same hierarchy and recovers the waveform with a neural vocoder. A reader would care if this yields higher perceptual quality and intelligibility than spectral-domain methods while requiring far fewer sampling steps for inference. The work tests the idea on standard enhancement benchmarks and reports gains in both quality metrics and computational efficiency.

Core claim

PhASE-Flow performs flow matching directly in the SSL representation domain by learning the conditional distribution of clean acoustic representations given phonetic representations, then reconstructs the enhanced waveform using a neural vocoder; experiments show this outperforms prior state-of-the-art baselines on perceptual quality and intelligibility metrics while remaining competitive even when limited to four sampling steps.

What carries the argument

PhASE-Flow, the phonetic-conditioned acoustic flow matching model that operates entirely inside the SSL latent space rather than the spectral domain.

If this is right

The method delivers measurable gains in perceptual quality and speech intelligibility over existing enhancement systems.
Competitive results are obtained with only four sampling steps, reducing inference cost relative to typical diffusion or flow approaches.
Direct operation inside SSL representations removes the need for explicit spectral-domain processing while still allowing waveform reconstruction via vocoder.
The phonetic conditioning step exploits the hierarchical structure already present in SSL features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could be tested on other generative audio tasks that already use SSL features, such as voice conversion or source separation.
If the four-step regime holds across datasets, the approach may enable lower-latency enhancement on edge devices.
Success would suggest that many current spectral-domain generative models for audio can be replaced by latent-space versions without loss of fidelity.

Load-bearing premise

That the SSL latent space already contains cleanly separated acoustic and phonetic information so that conditioning one on the other produces a waveform free of new artifacts after vocoding.

What would settle it

A controlled listening test or objective metric comparison in which PhASE-Flow scores no higher than a strong spectral-domain flow-matching baseline or requires substantially more than four sampling steps to match its quality.

Figures

Figures reproduced from arXiv: 2606.17806 by Dahan Wang, Jing Lu, Jun Gao, Xiaobin Rong, Yu Sun.

**Figure 1.** Figure 1: Overview of the proposed PhASE-Flow framework. of acoustic representations conditioned on phonetic ones. During inference, the generated representations are converted into enhanced waveforms using a pre-trained neural vocoder. Experimental results demonstrate that PhASE-Flow achieves substantial improvements in speech quality, intelligibility, and speaker similarity. Notably, our framework delivers comp… view at source ↗

read the original abstract

Flow matching (FM) enables high-fidelity generation, while self-supervised learning (SSL) speech models provide hierarchical representations spanning acoustic and phonetic levels. However, existing FM-based speech enhancement (SE) methods operate primarily in the spectral domain, treating SSL features only as external conditions rather than modeling directly in the SSL latent space. To fully exploit the structural richness of SSL representations, we propose PhASE-Flow, an FM-based SE framework that operates entirely in the SSL space. It models the conditional distribution of clean acoustic representations given phonetic ones, reconstructing the waveform via a neural vocoder. Experiments show that PhASE-Flow outperforms state-of-the-art baselines in perceptual quality and intelligibility. Notably, it achieves competitive performance with only four sampling steps, enabling highly efficient inference. Audio demos are available at https://anonymous.4open.science/w/phase-flow_demo-E6E1/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhASE-Flow shifts flow matching into SSL space with phonetic conditioning and claims strong results from only four steps, but the abstract supplies no numbers or ablations to evaluate those claims.

read the letter

PhASE-Flow's central move is to run flow matching entirely inside SSL representations rather than in the spectral domain, conditioning the clean acoustic features on phonetic ones extracted from the same SSL model and then decoding with a neural vocoder. The efficiency angle—competitive performance at four sampling steps—is the part that stands out as potentially useful for real applications.

The framing itself is a reasonable extension of existing work. Prior FM-based enhancement methods already use SSL features as side information; pulling the generative process into the latent space and making the phonetic level explicit is a direct way to test whether the hierarchical structure in SSL models can be leveraged more deeply. That logic is clear and does not rely on circular definitions.

The main gap is empirical. The abstract states outperformance in perceptual quality and intelligibility without reporting any metrics, baselines, datasets, or ablation results. Without those details it is impossible to judge whether the SSL-to-vocoder path actually preserves the information that spectral methods retain or whether the phonetic conditioning produces a measurable gain. The risk of new artifacts from the representation-domain pipeline is plausible and needs direct testing.

This paper is aimed at researchers already working on generative speech enhancement or on better use of SSL features. Someone following flow matching or SSL-conditioned models would find the setup familiar enough to evaluate quickly once the numbers are available.

The argument is internally consistent and engages the relevant literature without obvious fitting issues. I would send it to peer review because the idea is focused and the efficiency claim is practically relevant; referees can check whether the experiments actually support the shift to SSL space.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes PhASE-Flow, a flow-matching (FM) framework for speech enhancement that operates directly in the self-supervised learning (SSL) representation domain. It models the conditional distribution of clean acoustic representations given phonetic representations inside the SSL latent space and reconstructs the waveform via a neural vocoder. The central claims are that this yields superior perceptual quality and intelligibility over state-of-the-art baselines while remaining competitive with only four sampling steps.

Significance. If the empirical results are robust, the work would be significant for showing that direct generative modeling in hierarchical SSL space (with explicit phonetic conditioning) can outperform spectral-domain FM baselines for enhancement. The reported four-step efficiency would be a practical strength for real-time applications. The provision of audio demos supports perceptual evaluation, though overall significance hinges on the strength and transparency of the quantitative evidence.

major comments (1)

[Abstract] Abstract: the claim that PhASE-Flow 'outperforms state-of-the-art baselines in perceptual quality and intelligibility' and 'achieves competitive performance with only four sampling steps' is presented without any metrics, baselines, datasets, statistical tests, or ablation results. This absence makes the central empirical claim impossible to evaluate from the supplied text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify the presentation of our results. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that PhASE-Flow 'outperforms state-of-the-art baselines in perceptual quality and intelligibility' and 'achieves competitive performance with only four sampling steps' is presented without any metrics, baselines, datasets, statistical tests, or ablation results. This absence makes the central empirical claim impossible to evaluate from the supplied text.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support for the claims. The full manuscript already reports these details (PESQ, STOI, MOS, dataset names, baselines, and four-step comparisons) in Sections 4 and 5, but they are not summarized in the abstract. In the revised version we will insert a concise results sentence citing the key metrics, primary baselines, and the four-step efficiency result, while retaining the overall length constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context present PhASE-Flow as a framework that applies external flow matching techniques directly in the SSL representation domain with phonetic conditioning, followed by a neural vocoder for waveform reconstruction. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to a quantity defined by the authors' own prior work or by construction. The central claims of outperformance and efficiency rest on experimental comparisons against external baselines rather than tautological self-definitions or fitted inputs renamed as predictions. The derivation chain is therefore self-contained and draws on independent external literature for its foundational components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training procedures, or architectural details are present from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5693 in / 1027 out tokens · 31431 ms · 2026-06-26T23:02:46.112802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 8 canonical work pages · 4 internal anchors

[1]

PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

Introduction Speech enhancement (SE) aims at recovering clean speech from noisy observations to improve perceptual quality and speech intelligibility. While conventional discriminative methods are effective at noise attenuation, they often struggle to preserve speech naturalness under challenging acoustic conditions [1]. Recently, generative methods have ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Method 2.1. Framework Overview As illustrated in Figure 1, PhASE-Flow comprises three integral modules: (1) a frozen WavLM encoder to extract acoustic and phonetic representations from noisy inputs; (2) a trainable DiT- based FM module, whose backbone is adapted from [17], to model the distribution of clean acoustic representations; and (3) a pre-trained ...

2020
[3]

Datasets The clean speech corpus comprises publicly available data from the DNS5 LibriV ox subset [23], VCTK [24], EARS [25], and LibriSpeech [26]

Experiments 3.1. Datasets The clean speech corpus comprises publicly available data from the DNS5 LibriV ox subset [23], VCTK [24], EARS [25], and LibriSpeech [26]. To ensure high-quality training data, we apply data filtering by retaining only samples with DNSMOS scores (OVRL, SIG, BAK, and P.808) above 3.0 and UTMOS scores above 4.0. The EARS dataset is...

2020
[4]

Our work demonstrates that this approach offers a robust and well-structured alternative to conventional spectral- domain methods

Conclusion In this paper, we introduce PhASE-Flow, an FM-based SE framework that models speech distributions directly within the SSL domain. Our work demonstrates that this approach offers a robust and well-structured alternative to conventional spectral- domain methods. Experiments show that PhASE-Flow achieves superior perceptual quality and speaker sim...
[5]

12274221) and the Yangtze River Delta Science and Technology Innovation Community Joint Re- search Project (Grant No

Acknowledgments This work was supported by the National Natural Science Foun- dation of China (Grant No. 12274221) and the Yangtze River Delta Science and Technology Innovation Community Joint Re- search Project (Grant No. 2024CSJGG1100)
[6]

Generative AI was employed exclusively for minor language editing and polishing to enhance clarity and readabil- ity

Generative AI Use Disclosure The authors confirm that no generative AI tools were used to create any original ideas, analyses, or substantial content in this manuscript. Generative AI was employed exclusively for minor language editing and polishing to enhance clarity and readabil- ity. The authors assume full responsibility and accountability for the int...
[7]

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,” inInterspeech 2025, 2025, pp. 4858–4862

2025
[8]

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

X. Li, H. Xie, Z. Wang, Z. Zhang, L. Xiao, and L. Xie, “Sense: Semantic-aware high-fidelity universal speech enhance- ment,”arXiv preprint arXiv:2509.24708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders,

X. Sun, H. Dinkel, Y . Niu, L. Wang, J. Zhang, and J. Luan, “Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders,” inInterspeech 2025, 2025, pp. 4848– 4852

2025
[10]

Speech enhancement and dereverberation with diffusion-based generative models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

2023
[11]

Storm: A diffusion-based stochastic regeneration model for speech en- hancement and dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech en- hancement and dereverberation,”IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

2023
[12]

Selm: Speech enhancement using discrete tokens and language models,

Z. Wang, X. Zhu, Z. Zhang, Y . Lv, N. Jiang, G. Zhao, and L. Xie, “Selm: Speech enhancement using discrete tokens and language models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 561–11 565

2024
[13]

Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens,

H. Yang, J. Su, M. Kim, and Z. Jin, “Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens,” inInterspeech 2024, 2024, pp. 1170–1174

2024
[14]

Pase: Leveraging the phonological prior of wavlm for low- hallucination generative speech enhancement,

X. Rong, Q. Hu, M. Yesilbursa, K. Wojcicki, and J. Lu, “Pase: Leveraging the phonological prior of wavlm for low- hallucination generative speech enhancement,”arXiv preprint arXiv:2511.13300, 2025

work page arXiv 2025
[15]

Rethinking flow and diffusion bridge models for speech enhancement,

D. Wang, J. Gao, T. Lei, Y . Hu, C. Zhu, K. Chen, and J. Lu, “Rethinking flow and diffusion bridge models for speech enhancement,” 2026. [Online]. Available: https: //arxiv.org/abs/2602.18355

work page arXiv 2026
[16]

Generative speech foundation model pretraining for high-quality speech extraction and restoration,

P.-J. Ku, A. H. Liu, R. Korostik, S.-F. Huang, S.-W. Fu, and A. Juki ´c, “Generative speech foundation model pretraining for high-quality speech extraction and restoration,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[17]

Empirical distributions of dft- domain speech coefficients based on estimated speech variances,

T. Gerkmann and R. Martin, “Empirical distributions of dft- domain speech coefficients based on estimated speech variances,” inProc. Int. Workshop Acoust. Echo Noise Control, 2010, pp. 1–4

2010
[18]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921

2021
[19]

Investigating self-supervised learning for speech enhancement and separation,

Z. Huang, S. Watanabe, S.-w. Yang, P. Garc´ıa, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” inICASSP 2022-2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6837–6841

2022
[20]

Boosting self-supervised embeddings for speech en- hancement,

K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech en- hancement,” inInterspeech 2022, 2022, pp. 186–190

2022
[21]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[22]

Scalable diffusion models with transform- ers,

W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4172–4182

2023
[23]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271

2025
[24]

Back to Basics: Let Denoising Generative Models Denoise

T. Li and K. He, “Back to basics: Let denoising generative models denoise,”arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representations,
[26]

Available: https://openreview.net/forum?id= PqvMRDCJT9t

[Online]. Available: https://openreview.net/forum?id= PqvMRDCJT9t
[27]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023

work page arXiv 2023
[28]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Li, Z. Zhang, X. Yang, R. Huang, Y . Jiang, Q. Chen, S. Zheng, and Z. Zhao, “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https:/...

2025
[29]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” inAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran As- sociates, Inc., 2023, pp. 27 980–27 993. [Online]. Avail- able: https://proceedings.neurips...

2023
[30]

Icassp 2023 deep noise suppression challenge,

H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cut- ler, A. Ju, M. Zohourian, M. Tang, M. Golestaneh, and R. Aich- ner, “Icassp 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024

2023
[31]

The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Re- search and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–4

2013
[32]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dere- verberation,

J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dere- verberation,” inInterspeech 2024, 2024, pp. 4873–4877

2024
[33]

Lib- rispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

2015
[34]

WHAM!: Extending speech separation to noisy environments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inInterspeech 2019, 2019, pp. 1368–1372

2019
[35]

FSD50K: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

2021
[36]

FMA: A Dataset For Music Analysis

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bres- son, “FMA: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

A study on data augmentation of reverberant speech for robust speech recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 5220–5224

2017
[38]

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” inInterspeech 2020, 2020, pp. 2492–2496

2020
[39]

Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to eval- uate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to eval- uate noise suppressors,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 886–890

2022
[40]

UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech 2022, 2022, pp. 4521– 4525

2022
[41]

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,” inIn- terspeech 2024, 2024, pp. 4943–4947

2024
[42]

mHuBERT-147: A Compact Multilingual HuBERT Model,

M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapode- scu, “mHuBERT-147: A Compact Multilingual HuBERT Model,” inInterspeech 2024, 2024, pp. 3939–3943

2024
[43]

Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,

J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,” inSpeech Commu- nication; 15th ITG Conference, 2023, pp. 265–269

2023
[44]

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,

Q. Xu, A. Baevski, and M. Auli, “Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,” inInterspeech 2022, 2022, pp. 2113–2117

2022
[45]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23– ...

2023
[46]

Tf-gridnet: Integrating full- and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

2023
[47]

Llase-g1: Incentivizing gener- alization capability for llama-based speech enhancement,

B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiaoet al., “Llase-g1: Incentivizing gener- alization capability for llama-based speech enhancement,”arXiv preprint arXiv:2503.00493, 2025

work page arXiv 2025
[48]

Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,

J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3085–3098, 2025

2025
[49]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976

2022

[1] [1]

PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

Introduction Speech enhancement (SE) aims at recovering clean speech from noisy observations to improve perceptual quality and speech intelligibility. While conventional discriminative methods are effective at noise attenuation, they often struggle to preserve speech naturalness under challenging acoustic conditions [1]. Recently, generative methods have ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Method 2.1. Framework Overview As illustrated in Figure 1, PhASE-Flow comprises three integral modules: (1) a frozen WavLM encoder to extract acoustic and phonetic representations from noisy inputs; (2) a trainable DiT- based FM module, whose backbone is adapted from [17], to model the distribution of clean acoustic representations; and (3) a pre-trained ...

2020

[3] [3]

Datasets The clean speech corpus comprises publicly available data from the DNS5 LibriV ox subset [23], VCTK [24], EARS [25], and LibriSpeech [26]

Experiments 3.1. Datasets The clean speech corpus comprises publicly available data from the DNS5 LibriV ox subset [23], VCTK [24], EARS [25], and LibriSpeech [26]. To ensure high-quality training data, we apply data filtering by retaining only samples with DNSMOS scores (OVRL, SIG, BAK, and P.808) above 3.0 and UTMOS scores above 4.0. The EARS dataset is...

2020

[4] [4]

Our work demonstrates that this approach offers a robust and well-structured alternative to conventional spectral- domain methods

Conclusion In this paper, we introduce PhASE-Flow, an FM-based SE framework that models speech distributions directly within the SSL domain. Our work demonstrates that this approach offers a robust and well-structured alternative to conventional spectral- domain methods. Experiments show that PhASE-Flow achieves superior perceptual quality and speaker sim...

[5] [5]

12274221) and the Yangtze River Delta Science and Technology Innovation Community Joint Re- search Project (Grant No

Acknowledgments This work was supported by the National Natural Science Foun- dation of China (Grant No. 12274221) and the Yangtze River Delta Science and Technology Innovation Community Joint Re- search Project (Grant No. 2024CSJGG1100)

[6] [6]

Generative AI was employed exclusively for minor language editing and polishing to enhance clarity and readabil- ity

Generative AI Use Disclosure The authors confirm that no generative AI tools were used to create any original ideas, analyses, or substantial content in this manuscript. Generative AI was employed exclusively for minor language editing and polishing to enhance clarity and readabil- ity. The authors assume full responsibility and accountability for the int...

[7] [7]

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching,” inInterspeech 2025, 2025, pp. 4858–4862

2025

[8] [8]

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

X. Li, H. Xie, Z. Wang, Z. Zhang, L. Xiao, and L. Xie, “Sense: Semantic-aware high-fidelity universal speech enhance- ment,”arXiv preprint arXiv:2509.24708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders,

X. Sun, H. Dinkel, Y . Niu, L. Wang, J. Zhang, and J. Luan, “Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders,” inInterspeech 2025, 2025, pp. 4848– 4852

2025

[10] [10]

Speech enhancement and dereverberation with diffusion-based generative models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

2023

[11] [11]

Storm: A diffusion-based stochastic regeneration model for speech en- hancement and dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech en- hancement and dereverberation,”IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

2023

[12] [12]

Selm: Speech enhancement using discrete tokens and language models,

Z. Wang, X. Zhu, Z. Zhang, Y . Lv, N. Jiang, G. Zhao, and L. Xie, “Selm: Speech enhancement using discrete tokens and language models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 561–11 565

2024

[13] [13]

Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens,

H. Yang, J. Su, M. Kim, and Z. Jin, “Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens,” inInterspeech 2024, 2024, pp. 1170–1174

2024

[14] [14]

Pase: Leveraging the phonological prior of wavlm for low- hallucination generative speech enhancement,

X. Rong, Q. Hu, M. Yesilbursa, K. Wojcicki, and J. Lu, “Pase: Leveraging the phonological prior of wavlm for low- hallucination generative speech enhancement,”arXiv preprint arXiv:2511.13300, 2025

work page arXiv 2025

[15] [15]

Rethinking flow and diffusion bridge models for speech enhancement,

D. Wang, J. Gao, T. Lei, Y . Hu, C. Zhu, K. Chen, and J. Lu, “Rethinking flow and diffusion bridge models for speech enhancement,” 2026. [Online]. Available: https: //arxiv.org/abs/2602.18355

work page arXiv 2026

[16] [16]

Generative speech foundation model pretraining for high-quality speech extraction and restoration,

P.-J. Ku, A. H. Liu, R. Korostik, S.-F. Huang, S.-W. Fu, and A. Juki ´c, “Generative speech foundation model pretraining for high-quality speech extraction and restoration,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[17] [17]

Empirical distributions of dft- domain speech coefficients based on estimated speech variances,

T. Gerkmann and R. Martin, “Empirical distributions of dft- domain speech coefficients based on estimated speech variances,” inProc. Int. Workshop Acoust. Echo Noise Control, 2010, pp. 1–4

2010

[18] [18]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921

2021

[19] [19]

Investigating self-supervised learning for speech enhancement and separation,

Z. Huang, S. Watanabe, S.-w. Yang, P. Garc´ıa, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” inICASSP 2022-2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6837–6841

2022

[20] [20]

Boosting self-supervised embeddings for speech en- hancement,

K.-H. Hung, S. wei Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech en- hancement,” inInterspeech 2022, 2022, pp. 186–190

2022

[21] [21]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[22] [22]

Scalable diffusion models with transform- ers,

W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4172–4182

2023

[23] [23]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271

2025

[24] [24]

Back to Basics: Let Denoising Generative Models Denoise

T. Li and K. He, “Back to basics: Let denoising generative models denoise,”arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representations,

[26] [26]

Available: https://openreview.net/forum?id= PqvMRDCJT9t

[Online]. Available: https://openreview.net/forum?id= PqvMRDCJT9t

[27] [27]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023

work page arXiv 2023

[28] [28]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Li, Z. Zhang, X. Yang, R. Huang, Y . Jiang, Q. Chen, S. Zheng, and Z. Zhao, “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https:/...

2025

[29] [29]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” inAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran As- sociates, Inc., 2023, pp. 27 980–27 993. [Online]. Avail- able: https://proceedings.neurips...

2023

[30] [30]

Icassp 2023 deep noise suppression challenge,

H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cut- ler, A. Ju, M. Zohourian, M. Tang, M. Golestaneh, and R. Aich- ner, “Icassp 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024

2023

[31] [31]

The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Re- search and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–4

2013

[32] [32]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dere- verberation,

J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dere- verberation,” inInterspeech 2024, 2024, pp. 4873–4877

2024

[33] [33]

Lib- rispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

2015

[34] [34]

WHAM!: Extending speech separation to noisy environments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inInterspeech 2019, 2019, pp. 1368–1372

2019

[35] [35]

FSD50K: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

2021

[36] [36]

FMA: A Dataset For Music Analysis

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bres- son, “FMA: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[37] [37]

A study on data augmentation of reverberant speech for robust speech recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 5220–5224

2017

[38] [38]

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” inInterspeech 2020, 2020, pp. 2492–2496

2020

[39] [39]

Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to eval- uate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to eval- uate noise suppressors,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 886–890

2022

[40] [40]

UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech 2022, 2022, pp. 4521– 4525

2022

[41] [41]

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics,” inIn- terspeech 2024, 2024, pp. 4943–4947

2024

[42] [42]

mHuBERT-147: A Compact Multilingual HuBERT Model,

M. Zanon Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapode- scu, “mHuBERT-147: A Compact Multilingual HuBERT Model,” inInterspeech 2024, 2024, pp. 3939–3943

2024

[43] [43]

Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,

J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,” inSpeech Commu- nication; 15th ITG Conference, 2023, pp. 265–269

2023

[44] [44]

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,

Q. Xu, A. Baevski, and M. Auli, “Simple and Effective Zero-shot Cross-lingual Phoneme Recognition,” inInterspeech 2022, 2022, pp. 2113–2117

2022

[45] [45]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23– ...

2023

[46] [46]

Tf-gridnet: Integrating full- and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

2023

[47] [47]

Llase-g1: Incentivizing gener- alization capability for llama-based speech enhancement,

B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiaoet al., “Llase-g1: Incentivizing gener- alization capability for llama-based speech enhancement,”arXiv preprint arXiv:2503.00493, 2025

work page arXiv 2025

[48] [48]

Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,

J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3085–3098, 2025

2025

[49] [49]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976

2022