pith. sign in

arxiv: 2512.19442 · v3 · submitted 2025-12-22 · 📡 eess.SP · cs.LG· cs.SD

Real-Time Streamable Generative Speech Restoration with Flow Matching

Pith reviewed 2026-05-16 20:36 UTC · model grok-4.3

classification 📡 eess.SP cs.LGcs.SD
keywords speech restorationflow matchingreal-time streaminggenerative modelsspeech enhancementlow latencydereverberation
0
0 comments X

The pith

A flow-matching model restores speech in real time at 32 ms algorithmic latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Stream.FM, a frame-causal generative model based on flow matching that performs speech restoration tasks while meeting real-time constraints. It uses a buffered streaming inference scheme and an optimized network to reach 32 ms algorithmic latency and 48 ms total latency on consumer hardware. The approach supports multiple tasks including enhancement, dereverberation, bandwidth extension, and phase retrieval. Evaluations including MUSHRA tests show it reaches state-of-the-art quality for streaming generative restoration and outperforms prior streaming diffusion work at lower latency, with only modest quality loss relative to non-streaming versions.

Core claim

Stream.FM is a frame-causal flow-based generative model that solves speech restoration tasks in a streaming manner at 32 ms algorithmic latency and 48 ms total latency, achieving state-of-the-art quality among generative streaming methods while exhibiting only a reasonable reduction compared to its non-streaming counterpart.

What carries the argument

Buffered streaming inference scheme applied to a frame-causal flow-matching generative model, paired with learned few-step solvers and an optimized DNN architecture.

If this is right

  • Multiple speech restoration tasks can be handled by one unified streaming generative model.
  • Generative speech processing becomes viable for real-time communication on current consumer hardware.
  • Quality at fixed compute improves through learned few-step numerical solvers rather than more steps.
  • Model compression can be used to navigate compute-quality tradeoffs in deployed streaming systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same buffering and solver techniques could be tested on other audio signals such as music or environmental sound.
  • Lower-latency variants might support applications like hearing aids or live captioning if quality holds.
  • Integration with existing codecs could reduce end-to-end delay in voice-over-IP systems.

Load-bearing premise

The combination of buffered streaming inference, few-step solvers, and the chosen architecture maintains generative quality without artifacts at the target latency on consumer GPUs.

What would settle it

Objective metrics or a MUSHRA listening test that shows audible artifacts or a large quality drop when the model runs at 32 ms algorithmic latency versus offline processing.

Figures

Figures reproduced from arXiv: 2512.19442 by Bunlong Lay, Maris Hillemann, Simon Welker, Tal Peer, Timo Gerkmann.

Figure 1
Figure 1. Figure 1: Inference for one new frame (orange) in a simplified frame [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Violin plots of the scores listeners assigned to examples [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Metrics of compressed Stream.FM models for Mel vocoding [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream$.$FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream$.$FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream$.$FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Stream.FM, a frame-causal flow-matching generative model for real-time speech restoration. It targets multiple tasks (enhancement, dereverberation, codec post-filtering, bandwidth extension, phase retrieval, Mel vocoding) with a buffered streaming inference scheme, optimized DNN architecture, learned few-step solvers, and weight compression. The central claims are an algorithmic latency of 32 ms (48 ms total), only a reasonable quality drop relative to a non-streaming variant, and SOTA performance on generative streaming restoration that outperforms the authors' prior Diffusion Buffer work at lower latency, supported by objective metrics and a MUSHRA listening test on consumer GPUs.

Significance. If the latency figures and quality retention hold, the work would constitute a meaningful engineering contribution by demonstrating that flow-matching generative models can operate at real-time communication latencies without prohibitive quality loss, extending high-naturalness restoration beyond offline settings.

major comments (2)
  1. [§3.3] §3.3 (Buffered Streaming Inference): the description of how the flow-matching ODE is solved across buffered frames does not explicitly address whether the learned few-step solver preserves strict frame causality at the 32 ms algorithmic latency boundary; a concrete walk-through of the first and last frame handling would be required to substantiate the no-future-leakage claim.
  2. [Table 2] Table 2 (MUSHRA results): the reported mean scores for Stream.FM versus the non-streaming baseline differ by only 0.3 points on the 0-100 scale with no error bars, confidence intervals, or statistical significance test; this weakens the assertion of a 'reasonable reduction' and the SOTA claim relative to other streaming baselines.
minor comments (3)
  1. [Abstract] Abstract: the phrase 'comprehensive evaluations' should name the specific datasets (e.g., VCTK, DNS) and objective metrics (PESQ, STOI, etc.) used for each task to improve immediate readability.
  2. [§4.1] §4.1: the model compression results would benefit from an explicit statement of the bit-width or pruning ratio at the operating point that achieves the 48 ms total latency.
  3. [Figure 4] Figure 4: the latency-quality Pareto curve lacks a legend entry for the 24 ms variant mentioned in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Buffered Streaming Inference): the description of how the flow-matching ODE is solved across buffered frames does not explicitly address whether the learned few-step solver preserves strict frame causality at the 32 ms algorithmic latency boundary; a concrete walk-through of the first and last frame handling would be required to substantiate the no-future-leakage claim.

    Authors: We agree that an explicit walk-through would improve clarity. The learned few-step solver operates on each buffered frame independently using only past and current-frame information: the initial condition for frame n is taken from the denoised output of frame n-1 (or zero for the first frame), and the ODE integration uses the current input buffer without any lookahead. In the revised manuscript we will expand §3.3 with a step-by-step description and pseudocode for the first frame (initialized from silence) and final frame (processed with the same causal buffer), confirming that the 32 ms algorithmic latency boundary is respected with no future leakage. revision: yes

  2. Referee: [Table 2] Table 2 (MUSHRA results): the reported mean scores for Stream.FM versus the non-streaming baseline differ by only 0.3 points on the 0-100 scale with no error bars, confidence intervals, or statistical significance test; this weakens the assertion of a 'reasonable reduction' and the SOTA claim relative to other streaming baselines.

    Authors: We acknowledge that the current presentation of Table 2 lacks statistical detail. The 0.3-point difference is small and consistent with our claim of reasonable quality retention, yet we agree that error bars, confidence intervals, and a significance test would strengthen the evidence. In the revised version we will update Table 2 to report standard deviations across listeners, 95% confidence intervals, and the p-value from a paired statistical test, thereby better supporting both the quality-retention statement and the SOTA comparison. revision: yes

Circularity Check

1 steps flagged

Minor self-citation in performance comparison; core claims rest on new experiments

specific steps
  1. self citation load bearing [Abstract]
    "outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency."

    The superiority claim is framed relative to the authors' own prior publication rather than an independent external baseline, introducing a self-referential element into the SOTA assertion even though the paper supplies new MUSHRA and objective results for the current model.

full rationale

The paper proposes a new frame-causal flow-matching architecture, buffered streaming inference, few-step solvers, and compression techniques for 32 ms latency. All technical claims are supported by the authors' own reported MUSHRA listening tests and objective metrics on the new model variants. The sole self-citation appears in the abstract's comparative statement against the authors' prior Diffusion Buffer work; this is not load-bearing for the derivation or SOTA claim, which instead relies on fresh evaluations. No equation, ansatz, or uniqueness result reduces to a prior self-citation or to a fitted input by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; latency figures and model name are presented as design outcomes rather than fitted quantities.

pith-pipeline@v0.9.0 · 5585 in / 1110 out tokens · 30196 ms · 2026-05-16T20:36:39.574397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/ArrowOfTime.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    presents Stream.FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms... buffered streaming inference scheme and an optimized DNN architecture... learned few-step numerical solvers

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

    eess.AS 2026-05 unverdicted novelty 6.0

    SIPS decomposes stochastic interpolant dynamics into predictive drift and generative denoising to combine arbitrary pretrained predictors with a degradation-agnostic clean-speech prior for better speech enhancement an...

  2. Real-time Speech Restoration using Data Prediction Mean Flows

    eess.AS 2026-05 unverdicted novelty 5.0

    A Data Prediction Mean Flow model enables real-time speech restoration with 120x lower compute and no algorithmic latency beyond the STFT while matching state-of-the-art offline quality.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers

  1. [1]

    Speech enhancement and dereverberation with diffusion-based generative models,

    J. Richter, S. W elker, J.-M. Lemercier, B. Lay, and T . Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 2351–2364, 2023

  2. [2]

    Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,

    J.-M. Lemercier, J. Richter, S. W elker, and T . Gerkmann, “ Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2023

  3. [3]

    ScoreDec: A phase-preserving high-fidelity audio codec with a generalized score-based diffusion post-filter,

    Y .-C. Wu, D. Markovi´c, S. Krenn, I. D. Gebru, and A. Richard, “ScoreDec: A phase-preserving high-fidelity audio codec with a generalized score-based diffusion post-filter, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2024

  4. [4]

    FlowDec: A flow-based full-band general audio codec with high perceptual quality,

    S. W elker, M. Le, R. T . Q. Chen, W .-N. Hsu, T . Gerkmann, A. Richard, and Y .-C. Wu, “FlowDec: A flow-based full-band general audio codec with high perceptual quality, ” inInt. Conf. on Learning Repres. (ICLR), 2025

  5. [5]

    StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

    J.-M. Lemercier, J. Richter, S. W elker, and T . Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 2724–2737, 2023

  6. [6]

    DiffPhase: Generative diffusion-based STFT phase retrieval,

    T . Peer, S. W elker, and T . Gerkmann, “DiffPhase: Generative diffusion-based STFT phase retrieval,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2023

  7. [7]

    BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,

    S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T . Keebler, J. Sandakly, F . Y u, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models, ” inInt. Conf. on Machine Learning (ICML), 2025

  8. [8]

    Diffusion buffer: Online diffusion- based speech enhancement with sub-second latency,

    B. Lay, R. Makarov, and T . Gerkmann, “Diffusion buffer: Online diffusion- based speech enhancement with sub-second latency, ”Interspeech, 2025

  9. [9]

    T owards real-time generative speech restoration with flow-matching,

    T .-A. Hsieh and S. Braun, “T owards real-time generative speech restoration with flow-matching, ”arXiv preprint arXiv:2510.16997, 2025

  10. [10]

    A T wo-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement,

    S. Lu, H. Huang, J. Y ao, K. W ang, Q. Hong, and L. Li, “A T wo-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement, ” inInterspeech, 2025

  11. [11]

    Flow matching for generative modeling,

    Y . Lipman, R. T . Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling, ” inInt. Conf. on Learning Repres. (ICLR), 2023

  12. [12]

    Real-time streaming Mel vocoding with generative flow matching,

    S. W elker, T . Peer, and T . Gerkmann, “Real-time streaming Mel vocoding with generative flow matching, ”arXiv preprint arXiv:2509.15085, 2025

  13. [13]

    Real-time diffusion demo for speech enhancement with 48ms latency,

    S. W elker, M. Hillemann, B. Lay, and T . Gerkmann, “Real-time diffusion demo for speech enhancement with 48ms latency, ” inDemo P apers at the ITG Conference on Speech Communication, 2025

  14. [14]

    Conditional diffusion probabilistic model for speech enhancement,

    Y .-J. Lu, Z.-Q. W ang, S. W atanabe, A. Richard, C. Y u, and Y . Tsao, “Conditional diffusion probabilistic model for speech enhancement, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2022

  15. [15]

    Speech enhancement with score-based generative models in the complex STFT domain,

    S. W elker, J. Richter, and T . Gerkmann, “Speech enhancement with score-based generative models in the complex STFT domain, ” inInterspeech, 2022

  16. [16]

    Score-based generative modeling through stochastic differential equations,

    Y . Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” inInt. Conf. on Learning Repres. (ICLR), 2021

  17. [17]

    Multisample flow matching: Straightening flows with minibatch couplings,

    A.-A. Pooladian, H. Ben-Hamu, C. Domingo-Enrich, B. Amos, Y . Lipman, and R. T . Q. Chen, “Multisample flow matching: Straightening flows with minibatch couplings, ” inInt. Conf. on Machine Learning (ICML), 2023

  18. [18]

    Unsupervised low latency speech enhancement with RT -GCC-NMF,

    S. U. W ood and J. Rouat, “Unsupervised low latency speech enhancement with RT -GCC-NMF,”IEEE J. Sel. T op. Signal Proc. (JSTSP), vol. 13, no. 2, pp. 332–346, 2019

  19. [19]

    STFT-domain neural speech enhancement with very low algorithmic latency,

    Z.-Q. W ang, G. Wichern, S. W atanabe, and J. Le Roux, “STFT-domain neural speech enhancement with very low algorithmic latency, ”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 397–410, 2023

  20. [20]

    Hairer, S

    E. Hairer, S. P . Norsett, and G. W anner,Solving ordinary differential equations I, 2nd ed., ser. Springer Series in Computational Mathematics. Springer, 1993

  21. [21]

    On Runge-Kutta processes of high order,

    J. C. Butcher, “On Runge-Kutta processes of high order,”Journal of the Australian Mathematical Society, vol. 4, no. 2, p. 179–194, 1964

  22. [22]

    Real time speech enhancement in the waveform domain,

    A. D´efossez, G. Synnaeve, and Y . Adi, “Real time speech enhancement in the waveform domain, ” inInterspeech, 2020

  23. [23]

    Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio,

    Y . R. Pei, R. Shrivastava, and F . Sidharth, “Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio, ” inInterspeech, 2025

  24. [24]

    HiFi-Stream: Streaming speech enhancement with generative adversarial networks,

    E. Dmitrieva and M. Kaledin, “HiFi-Stream: Streaming speech enhancement with generative adversarial networks,”IEEE Signal Proc. Lett. (SPL), vol. 32, pp. 3595–3599, 2025

  25. [25]

    Causal diffusion models for generalized speech enhancement,

    J. Richter, S. W elker, J.-M. Lemercier, B. Lay, T . Peer, and T . Gerkmann, “Causal diffusion models for generalized speech enhancement,”IEEE Open J. Signal Proc., 2024

  26. [26]

    Continual inference: a library for efficient online inference with deep neural networks in PyT orch,

    L. Hedegaard and A. Iosifidis, “Continual inference: a library for efficient online inference with deep neural networks in PyT orch, ” inECCV W orkshops, 2022

  27. [27]

    Diffusion buffer for online generative speech enhancement,

    B. Lay, R. Makarov, S. W elker, M. Hillemann, and T . Gerkmann, “Diffusion buffer for online generative speech enhancement,”arXiv preprint arXiv:2510.18744, 2025

  28. [28]

    Group normalization,

    Y . Wu and K. He, “Group normalization, ” inEur . Conf. Comput. V is., 2018

  29. [29]

    Subspectral normalization for neural audio data processing,

    S. Chang, H. Park, J. Cho, H. Park, S. Y un, and K. Hwang, “Subspectral normalization for neural audio data processing,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2021. 11

  30. [30]

    Parallel W aveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,

    R. Y amamoto, E. Song, and J.-M. Kim, “Parallel W aveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020

  31. [31]

    SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics,

    T . Saeki, S. Maiti, S. T akamichi, S. W atanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics, ” inInterspeech, 2024

  32. [32]

    Are these even words? quantifying the gibberishness of generative speech models,

    D. de Oliveira, T . Peer, J. Rochdi, and T . Gerkmann, “ Are these even words? quantifying the gibberishness of generative speech models,”arXiv preprint arXiv:2510.21317, 2025

  33. [33]

    Runge-kutta methods with minimum error bounds,

    A. Ralston, “Runge-kutta methods with minimum error bounds, ”Mathematics of computation, vol. 16, no. 80, pp. 431–437, 1962

  34. [34]

    Network decoupling: From regular to depthwise separable convolutions,

    J. Guo, Y . Li, W . Lin, Y . Chen, and J. Li, “Network decoupling: From regular to depthwise separable convolutions, ” inBritish Machine V ision Conference, 2018

  35. [35]

    EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,

    J. Richter, Y .-C. Wu, S. Krenn, S. W elker, B. Lay, S. W atanabe, A. Richard, and T . Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation, ” inInterspeech, 2024

  36. [36]

    High-fidelity audio compression with improved R VQGAN,

    R. Kumar, P . Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved R VQGAN,” inAdvances in Neural Inf. Proc. Systems (NeurIPS), 2023

  37. [37]

    A flexible online framework for projection-based STFT phase retrieval,

    T . Peer, S. W elker, J. Kolhoff, and T . Gerkmann, “ A flexible online framework for projection-based STFT phase retrieval,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2024

  38. [38]

    HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis, ”Advances in Neural Inf. Proc. Systems (NeurIPS), 2020

  39. [39]

    Open-source conversational AI with SpeechBrain 1.0,

    M. Ravanelli, T . Parcollet, A. Moumen, S. de Langen, C. Subakan, P . Plantinga, Y . W ang, P . Mousavi, L. D. Libera, A. Ploujnikovet al., “Open-source conversational AI with SpeechBrain 1.0,”J. of Machine Learning Research, vol. 25, no. 333, 2024

  40. [40]

    SOAP: Improving and stabilizing shampoo using adam for language modeling,

    N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade, “SOAP: Improving and stabilizing shampoo using adam for language modeling, ” inInt. Conf. on Learning Repres. (ICLR), 2025

  41. [41]

    Optimization benchmark for diffusion models on dynamical systems,

    F . Schaipp, “Optimization benchmark for diffusion models on dynamical systems, ” inEurIPS W orkshop on Principles of Generative Modeling, 2025

  42. [42]

    DeepFilterNet: Perceptually motivated real-time speech enhancement,

    H. Schr¨oter, T . Rosenkranz, A. N. Escalante-B., and A. Maier, “DeepFilterNet: Perceptually motivated real-time speech enhancement, ” inInterspeech, 2023

  43. [43]

    Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

    A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2001

  44. [44]

    An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,

    J. Jensen and C. H. T aal, “ An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 24, no. 11, pp. 2009–2022, 2016

  45. [45]

    SDR - half-baked or well done?

    J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2019

  46. [46]

    Neural vocoder is all you need for speech super-resolution,

    H. Liu, W . Choi, X. Liu, Q. Kong, Q. Tian, and D. W ang, “Neural vocoder is all you need for speech super-resolution, ” inInterspeech, 2022

  47. [47]

    QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions,

    S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V . Lavrukhin, R. Leary, J. Li, and Y . Zhang, “QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020

  48. [48]

    arXiv preprint arXiv:1909.09577 , year=

    O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: a toolkit for building AI applications using neural modules, ”arXiv preprint arXiv:1909.09577, 2019

  49. [49]

    NISQA: A deep CNN- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A deep CNN- self-attention model for multidimensional speech quality prediction with crowdsourced datasets, ” inInterspeech, 2021

  50. [50]

    HiFi++: A unified framework for bandwidth extension and speech enhancement,

    P . Andreev, A. Alanov, O. Ivanov, and D. V etrov, “HiFi++: A unified framework for bandwidth extension and speech enhancement,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2023

  51. [51]

    Distillation and pruning for scalable self-supervised representation-based speech quality assessment,

    B. Stahl and H. Gamper, “Distillation and pruning for scalable self-supervised representation-based speech quality assessment,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2025

  52. [52]

    Method for the subjective assessment of intermediate quality level of audio systems,

    ITU-R Rec. BS.1534-3, “Method for the subjective assessment of intermediate quality level of audio systems, ”Int. T elecom. Union (ITU), 2014

  53. [53]

    Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,

    C. V alentini-Botinhao, X. W ang, S. T akaki, and J. Y amagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, ” in 9th ISCA W orkshop on Speech Synthesis W orkshop (SSW 9), 2016

  54. [54]

    Phase retrieval by iterated projections,

    V . Elser, “Phase retrieval by iterated projections,”J. Opt. Soc. Am. A, vol. 20, no. 1, p. 40, 2003

  55. [55]

    An efficient algorithm for real-time spectrogram inversion,

    G. T . Beauregard, X. Zhu, and L. W yse, “ An efficient algorithm for real-time spectrogram inversion, ” inInt. Conf. on Digital Audio Effects, 2005

  56. [56]

    LibriTTS: A corpus derived from LibriSpeech for text-to-speech,

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. W eiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019. Simon Welkerreceived a B.Sc. in Computing in Science (2019) and M.Sc. in Bioinformatics (2021) from Univer- sity of Hamburg, Germany. He is currently a PhD student in the labs of Prof. T...

  57. [57]

    Speech enhancement: A=   0 0 0 0 0.458 0 0 0 −0.847 1.623 0 0 2.029−1.707 0.528 0   b= 0.339,0.444,0.102,0.114 c= 0,0.458,0.776,0.850 (22)

  58. [58]

    Dereverberation: A=   0 0 0 0 0 0.152 0 0 0 0 −0.065 0.312 0 0 0 0.088 0.296 0.152 0 0 0.565 0.856 1.425−1.997 0   b= 0.079,0.223,0.423,0.184,0.091 c= 0,0.152,0.247,0.536,0.850 (23)

  59. [59]

    Codec post-filtering: A=   0 0 0 0 0 0.298 0 0 0 0 0.049 0.375 0 0 0 −0.245 1.030−0.219 0 0 0.672−0.168−0.276 0.622 0   b= 0.089,0.211,0.307,0.100,0.292 c= 0,0.298,0.424,0.566,0.850 (24)

  60. [60]

    Bandwidth extension: A=   0 0 0 0 0 0.112 0 0 0 0 −0.244 0.535 0 0 0 −1.093 1.840−0.217 0 0 −1.587 1.783 0.236 0.419 0   b= 0.085,0.211,0.262,0.097,0.344 c= 0,0.112,0.291,0.529,0.850 (25)

  61. [61]

    STFT phase retrieval: A=   0 0 0 0 0 0.271 0 0 0 0 0.216 0.198 0 0 0 −0.029 0.147 0.454 0 0 0.072 0.208 0.326 0.244 0   b= 0.128,0.209,0.307,0.130,0.227 c= 0,0.271,0.413,0.572,0.850 (26)

  62. [62]

    Mel vocoding: A=   0 0 0 0 0 0.251 0 0 0 0 0.104 0.286 0 0 0 −0.005 0.200 0.379 0 0 0.091 0.181 0.344 0.234 0   b= 0.134,0.208,0.307,0.122,0.229 c= 0,0.251,0.390,0.574,0.850 (27)