Real-Time Streamable Generative Speech Restoration with Flow Matching
Pith reviewed 2026-05-16 20:36 UTC · model grok-4.3
The pith
A flow-matching model restores speech in real time at 32 ms algorithmic latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stream.FM is a frame-causal flow-based generative model that solves speech restoration tasks in a streaming manner at 32 ms algorithmic latency and 48 ms total latency, achieving state-of-the-art quality among generative streaming methods while exhibiting only a reasonable reduction compared to its non-streaming counterpart.
What carries the argument
Buffered streaming inference scheme applied to a frame-causal flow-matching generative model, paired with learned few-step solvers and an optimized DNN architecture.
If this is right
- Multiple speech restoration tasks can be handled by one unified streaming generative model.
- Generative speech processing becomes viable for real-time communication on current consumer hardware.
- Quality at fixed compute improves through learned few-step numerical solvers rather than more steps.
- Model compression can be used to navigate compute-quality tradeoffs in deployed streaming systems.
Where Pith is reading between the lines
- The same buffering and solver techniques could be tested on other audio signals such as music or environmental sound.
- Lower-latency variants might support applications like hearing aids or live captioning if quality holds.
- Integration with existing codecs could reduce end-to-end delay in voice-over-IP systems.
Load-bearing premise
The combination of buffered streaming inference, few-step solvers, and the chosen architecture maintains generative quality without artifacts at the target latency on consumer GPUs.
What would settle it
Objective metrics or a MUSHRA listening test that shows audible artifacts or a large quality drop when the model runs at 32 ms algorithmic latency versus offline processing.
Figures
read the original abstract
Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream$.$FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream$.$FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream$.$FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Stream.FM, a frame-causal flow-matching generative model for real-time speech restoration. It targets multiple tasks (enhancement, dereverberation, codec post-filtering, bandwidth extension, phase retrieval, Mel vocoding) with a buffered streaming inference scheme, optimized DNN architecture, learned few-step solvers, and weight compression. The central claims are an algorithmic latency of 32 ms (48 ms total), only a reasonable quality drop relative to a non-streaming variant, and SOTA performance on generative streaming restoration that outperforms the authors' prior Diffusion Buffer work at lower latency, supported by objective metrics and a MUSHRA listening test on consumer GPUs.
Significance. If the latency figures and quality retention hold, the work would constitute a meaningful engineering contribution by demonstrating that flow-matching generative models can operate at real-time communication latencies without prohibitive quality loss, extending high-naturalness restoration beyond offline settings.
major comments (2)
- [§3.3] §3.3 (Buffered Streaming Inference): the description of how the flow-matching ODE is solved across buffered frames does not explicitly address whether the learned few-step solver preserves strict frame causality at the 32 ms algorithmic latency boundary; a concrete walk-through of the first and last frame handling would be required to substantiate the no-future-leakage claim.
- [Table 2] Table 2 (MUSHRA results): the reported mean scores for Stream.FM versus the non-streaming baseline differ by only 0.3 points on the 0-100 scale with no error bars, confidence intervals, or statistical significance test; this weakens the assertion of a 'reasonable reduction' and the SOTA claim relative to other streaming baselines.
minor comments (3)
- [Abstract] Abstract: the phrase 'comprehensive evaluations' should name the specific datasets (e.g., VCTK, DNS) and objective metrics (PESQ, STOI, etc.) used for each task to improve immediate readability.
- [§4.1] §4.1: the model compression results would benefit from an explicit statement of the bit-width or pruning ratio at the operating point that achieves the 48 ms total latency.
- [Figure 4] Figure 4: the latency-quality Pareto curve lacks a legend entry for the 24 ms variant mentioned in the abstract.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.3] §3.3 (Buffered Streaming Inference): the description of how the flow-matching ODE is solved across buffered frames does not explicitly address whether the learned few-step solver preserves strict frame causality at the 32 ms algorithmic latency boundary; a concrete walk-through of the first and last frame handling would be required to substantiate the no-future-leakage claim.
Authors: We agree that an explicit walk-through would improve clarity. The learned few-step solver operates on each buffered frame independently using only past and current-frame information: the initial condition for frame n is taken from the denoised output of frame n-1 (or zero for the first frame), and the ODE integration uses the current input buffer without any lookahead. In the revised manuscript we will expand §3.3 with a step-by-step description and pseudocode for the first frame (initialized from silence) and final frame (processed with the same causal buffer), confirming that the 32 ms algorithmic latency boundary is respected with no future leakage. revision: yes
-
Referee: [Table 2] Table 2 (MUSHRA results): the reported mean scores for Stream.FM versus the non-streaming baseline differ by only 0.3 points on the 0-100 scale with no error bars, confidence intervals, or statistical significance test; this weakens the assertion of a 'reasonable reduction' and the SOTA claim relative to other streaming baselines.
Authors: We acknowledge that the current presentation of Table 2 lacks statistical detail. The 0.3-point difference is small and consistent with our claim of reasonable quality retention, yet we agree that error bars, confidence intervals, and a significance test would strengthen the evidence. In the revised version we will update Table 2 to report standard deviations across listeners, 95% confidence intervals, and the p-value from a paired statistical test, thereby better supporting both the quality-retention statement and the SOTA comparison. revision: yes
Circularity Check
Minor self-citation in performance comparison; core claims rest on new experiments
specific steps
-
self citation load bearing
[Abstract]
"outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency."
The superiority claim is framed relative to the authors' own prior publication rather than an independent external baseline, introducing a self-referential element into the SOTA assertion even though the paper supplies new MUSHRA and objective results for the current model.
full rationale
The paper proposes a new frame-causal flow-matching architecture, buffered streaming inference, few-step solvers, and compression techniques for 32 ms latency. All technical claims are supported by the authors' own reported MUSHRA listening tests and objective metrics on the new model variants. The sole self-citation appears in the abstract's comparative statement against the authors' prior Diffusion Buffer work; this is not load-bearing for the derivation or SOTA claim, which instead relies on fresh evaluations. No equation, ansatz, or uniqueness result reduces to a prior self-citation or to a fitted input by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArrowOfTime.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
presents Stream.FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms... buffered streaming inference scheme and an optimized DNN architecture... learned few-step numerical solvers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
SIPS decomposes stochastic interpolant dynamics into predictive drift and generative denoising to combine arbitrary pretrained predictors with a degradation-agnostic clean-speech prior for better speech enhancement an...
-
Real-time Speech Restoration using Data Prediction Mean Flows
A Data Prediction Mean Flow model enables real-time speech restoration with 120x lower compute and no algorithmic latency beyond the STFT while matching state-of-the-art offline quality.
Reference graph
Works this paper leans on
-
[1]
Speech enhancement and dereverberation with diffusion-based generative models,
J. Richter, S. W elker, J.-M. Lemercier, B. Lay, and T . Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 2351–2364, 2023
work page 2023
-
[2]
J.-M. Lemercier, J. Richter, S. W elker, and T . Gerkmann, “ Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2023
work page 2023
-
[3]
Y .-C. Wu, D. Markovi´c, S. Krenn, I. D. Gebru, and A. Richard, “ScoreDec: A phase-preserving high-fidelity audio codec with a generalized score-based diffusion post-filter, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2024
work page 2024
-
[4]
FlowDec: A flow-based full-band general audio codec with high perceptual quality,
S. W elker, M. Le, R. T . Q. Chen, W .-N. Hsu, T . Gerkmann, A. Richard, and Y .-C. Wu, “FlowDec: A flow-based full-band general audio codec with high perceptual quality, ” inInt. Conf. on Learning Repres. (ICLR), 2025
work page 2025
-
[5]
StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,
J.-M. Lemercier, J. Richter, S. W elker, and T . Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 2724–2737, 2023
work page 2023
-
[6]
DiffPhase: Generative diffusion-based STFT phase retrieval,
T . Peer, S. W elker, and T . Gerkmann, “DiffPhase: Generative diffusion-based STFT phase retrieval,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2023
work page 2023
-
[7]
S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T . Keebler, J. Sandakly, F . Y u, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models, ” inInt. Conf. on Machine Learning (ICML), 2025
work page 2025
-
[8]
Diffusion buffer: Online diffusion- based speech enhancement with sub-second latency,
B. Lay, R. Makarov, and T . Gerkmann, “Diffusion buffer: Online diffusion- based speech enhancement with sub-second latency, ”Interspeech, 2025
work page 2025
-
[9]
T owards real-time generative speech restoration with flow-matching,
T .-A. Hsieh and S. Braun, “T owards real-time generative speech restoration with flow-matching, ”arXiv preprint arXiv:2510.16997, 2025
-
[10]
A T wo-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement,
S. Lu, H. Huang, J. Y ao, K. W ang, Q. Hong, and L. Li, “A T wo-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement, ” inInterspeech, 2025
work page 2025
-
[11]
Flow matching for generative modeling,
Y . Lipman, R. T . Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling, ” inInt. Conf. on Learning Repres. (ICLR), 2023
work page 2023
-
[12]
Real-time streaming Mel vocoding with generative flow matching,
S. W elker, T . Peer, and T . Gerkmann, “Real-time streaming Mel vocoding with generative flow matching, ”arXiv preprint arXiv:2509.15085, 2025
-
[13]
Real-time diffusion demo for speech enhancement with 48ms latency,
S. W elker, M. Hillemann, B. Lay, and T . Gerkmann, “Real-time diffusion demo for speech enhancement with 48ms latency, ” inDemo P apers at the ITG Conference on Speech Communication, 2025
work page 2025
-
[14]
Conditional diffusion probabilistic model for speech enhancement,
Y .-J. Lu, Z.-Q. W ang, S. W atanabe, A. Richard, C. Y u, and Y . Tsao, “Conditional diffusion probabilistic model for speech enhancement, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2022
work page 2022
-
[15]
Speech enhancement with score-based generative models in the complex STFT domain,
S. W elker, J. Richter, and T . Gerkmann, “Speech enhancement with score-based generative models in the complex STFT domain, ” inInterspeech, 2022
work page 2022
-
[16]
Score-based generative modeling through stochastic differential equations,
Y . Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” inInt. Conf. on Learning Repres. (ICLR), 2021
work page 2021
-
[17]
Multisample flow matching: Straightening flows with minibatch couplings,
A.-A. Pooladian, H. Ben-Hamu, C. Domingo-Enrich, B. Amos, Y . Lipman, and R. T . Q. Chen, “Multisample flow matching: Straightening flows with minibatch couplings, ” inInt. Conf. on Machine Learning (ICML), 2023
work page 2023
-
[18]
Unsupervised low latency speech enhancement with RT -GCC-NMF,
S. U. W ood and J. Rouat, “Unsupervised low latency speech enhancement with RT -GCC-NMF,”IEEE J. Sel. T op. Signal Proc. (JSTSP), vol. 13, no. 2, pp. 332–346, 2019
work page 2019
-
[19]
STFT-domain neural speech enhancement with very low algorithmic latency,
Z.-Q. W ang, G. Wichern, S. W atanabe, and J. Le Roux, “STFT-domain neural speech enhancement with very low algorithmic latency, ”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 31, pp. 397–410, 2023
work page 2023
- [20]
-
[21]
On Runge-Kutta processes of high order,
J. C. Butcher, “On Runge-Kutta processes of high order,”Journal of the Australian Mathematical Society, vol. 4, no. 2, p. 179–194, 1964
work page 1964
-
[22]
Real time speech enhancement in the waveform domain,
A. D´efossez, G. Synnaeve, and Y . Adi, “Real time speech enhancement in the waveform domain, ” inInterspeech, 2020
work page 2020
-
[23]
Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio,
Y . R. Pei, R. Shrivastava, and F . Sidharth, “Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio, ” inInterspeech, 2025
work page 2025
-
[24]
HiFi-Stream: Streaming speech enhancement with generative adversarial networks,
E. Dmitrieva and M. Kaledin, “HiFi-Stream: Streaming speech enhancement with generative adversarial networks,”IEEE Signal Proc. Lett. (SPL), vol. 32, pp. 3595–3599, 2025
work page 2025
-
[25]
Causal diffusion models for generalized speech enhancement,
J. Richter, S. W elker, J.-M. Lemercier, B. Lay, T . Peer, and T . Gerkmann, “Causal diffusion models for generalized speech enhancement,”IEEE Open J. Signal Proc., 2024
work page 2024
-
[26]
Continual inference: a library for efficient online inference with deep neural networks in PyT orch,
L. Hedegaard and A. Iosifidis, “Continual inference: a library for efficient online inference with deep neural networks in PyT orch, ” inECCV W orkshops, 2022
work page 2022
-
[27]
Diffusion buffer for online generative speech enhancement,
B. Lay, R. Makarov, S. W elker, M. Hillemann, and T . Gerkmann, “Diffusion buffer for online generative speech enhancement,”arXiv preprint arXiv:2510.18744, 2025
-
[28]
Y . Wu and K. He, “Group normalization, ” inEur . Conf. Comput. V is., 2018
work page 2018
-
[29]
Subspectral normalization for neural audio data processing,
S. Chang, H. Park, J. Cho, H. Park, S. Y un, and K. Hwang, “Subspectral normalization for neural audio data processing,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2021. 11
work page 2021
-
[30]
R. Y amamoto, E. Song, and J.-M. Kim, “Parallel W aveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020
work page 2020
-
[31]
T . Saeki, S. Maiti, S. T akamichi, S. W atanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics, ” inInterspeech, 2024
work page 2024
-
[32]
Are these even words? quantifying the gibberishness of generative speech models,
D. de Oliveira, T . Peer, J. Rochdi, and T . Gerkmann, “ Are these even words? quantifying the gibberishness of generative speech models,”arXiv preprint arXiv:2510.21317, 2025
-
[33]
Runge-kutta methods with minimum error bounds,
A. Ralston, “Runge-kutta methods with minimum error bounds, ”Mathematics of computation, vol. 16, no. 80, pp. 431–437, 1962
work page 1962
-
[34]
Network decoupling: From regular to depthwise separable convolutions,
J. Guo, Y . Li, W . Lin, Y . Chen, and J. Li, “Network decoupling: From regular to depthwise separable convolutions, ” inBritish Machine V ision Conference, 2018
work page 2018
-
[35]
EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,
J. Richter, Y .-C. Wu, S. Krenn, S. W elker, B. Lay, S. W atanabe, A. Richard, and T . Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation, ” inInterspeech, 2024
work page 2024
-
[36]
High-fidelity audio compression with improved R VQGAN,
R. Kumar, P . Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved R VQGAN,” inAdvances in Neural Inf. Proc. Systems (NeurIPS), 2023
work page 2023
-
[37]
A flexible online framework for projection-based STFT phase retrieval,
T . Peer, S. W elker, J. Kolhoff, and T . Gerkmann, “ A flexible online framework for projection-based STFT phase retrieval,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP). IEEE, 2024
work page 2024
-
[38]
HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis, ”Advances in Neural Inf. Proc. Systems (NeurIPS), 2020
work page 2020
-
[39]
Open-source conversational AI with SpeechBrain 1.0,
M. Ravanelli, T . Parcollet, A. Moumen, S. de Langen, C. Subakan, P . Plantinga, Y . W ang, P . Mousavi, L. D. Libera, A. Ploujnikovet al., “Open-source conversational AI with SpeechBrain 1.0,”J. of Machine Learning Research, vol. 25, no. 333, 2024
work page 2024
-
[40]
SOAP: Improving and stabilizing shampoo using adam for language modeling,
N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade, “SOAP: Improving and stabilizing shampoo using adam for language modeling, ” inInt. Conf. on Learning Repres. (ICLR), 2025
work page 2025
-
[41]
Optimization benchmark for diffusion models on dynamical systems,
F . Schaipp, “Optimization benchmark for diffusion models on dynamical systems, ” inEurIPS W orkshop on Principles of Generative Modeling, 2025
work page 2025
-
[42]
DeepFilterNet: Perceptually motivated real-time speech enhancement,
H. Schr¨oter, T . Rosenkranz, A. N. Escalante-B., and A. Maier, “DeepFilterNet: Perceptually motivated real-time speech enhancement, ” inInterspeech, 2023
work page 2023
-
[43]
A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2001
work page 2001
-
[44]
An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,
J. Jensen and C. H. T aal, “ An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE Trans. on Audio, Speech, and Lang. Proc. (TASLP), vol. 24, no. 11, pp. 2009–2022, 2016
work page 2009
-
[45]
SDR - half-baked or well done?
J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2019
work page 2019
-
[46]
Neural vocoder is all you need for speech super-resolution,
H. Liu, W . Choi, X. Liu, Q. Kong, Q. Tian, and D. W ang, “Neural vocoder is all you need for speech super-resolution, ” inInterspeech, 2022
work page 2022
-
[47]
QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions,
S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V . Lavrukhin, R. Leary, J. Li, and Y . Zhang, “QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions, ” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2020
work page 2020
-
[48]
arXiv preprint arXiv:1909.09577 , year=
O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: a toolkit for building AI applications using neural modules, ”arXiv preprint arXiv:1909.09577, 2019
-
[49]
G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A deep CNN- self-attention model for multidimensional speech quality prediction with crowdsourced datasets, ” inInterspeech, 2021
work page 2021
-
[50]
HiFi++: A unified framework for bandwidth extension and speech enhancement,
P . Andreev, A. Alanov, O. Ivanov, and D. V etrov, “HiFi++: A unified framework for bandwidth extension and speech enhancement,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2023
work page 2023
-
[51]
B. Stahl and H. Gamper, “Distillation and pruning for scalable self-supervised representation-based speech quality assessment,” inIEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2025
work page 2025
-
[52]
Method for the subjective assessment of intermediate quality level of audio systems,
ITU-R Rec. BS.1534-3, “Method for the subjective assessment of intermediate quality level of audio systems, ”Int. T elecom. Union (ITU), 2014
work page 2014
-
[53]
Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,
C. V alentini-Botinhao, X. W ang, S. T akaki, and J. Y amagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, ” in 9th ISCA W orkshop on Speech Synthesis W orkshop (SSW 9), 2016
work page 2016
-
[54]
Phase retrieval by iterated projections,
V . Elser, “Phase retrieval by iterated projections,”J. Opt. Soc. Am. A, vol. 20, no. 1, p. 40, 2003
work page 2003
-
[55]
An efficient algorithm for real-time spectrogram inversion,
G. T . Beauregard, X. Zhu, and L. W yse, “ An efficient algorithm for real-time spectrogram inversion, ” inInt. Conf. on Digital Audio Effects, 2005
work page 2005
-
[56]
LibriTTS: A corpus derived from LibriSpeech for text-to-speech,
H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. W eiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019. Simon Welkerreceived a B.Sc. in Computing in Science (2019) and M.Sc. in Bioinformatics (2021) from Univer- sity of Hamburg, Germany. He is currently a PhD student in the labs of Prof. T...
work page 2019
-
[57]
Speech enhancement: A= 0 0 0 0 0.458 0 0 0 −0.847 1.623 0 0 2.029−1.707 0.528 0 b= 0.339,0.444,0.102,0.114 c= 0,0.458,0.776,0.850 (22)
-
[58]
Dereverberation: A= 0 0 0 0 0 0.152 0 0 0 0 −0.065 0.312 0 0 0 0.088 0.296 0.152 0 0 0.565 0.856 1.425−1.997 0 b= 0.079,0.223,0.423,0.184,0.091 c= 0,0.152,0.247,0.536,0.850 (23)
-
[59]
Codec post-filtering: A= 0 0 0 0 0 0.298 0 0 0 0 0.049 0.375 0 0 0 −0.245 1.030−0.219 0 0 0.672−0.168−0.276 0.622 0 b= 0.089,0.211,0.307,0.100,0.292 c= 0,0.298,0.424,0.566,0.850 (24)
-
[60]
Bandwidth extension: A= 0 0 0 0 0 0.112 0 0 0 0 −0.244 0.535 0 0 0 −1.093 1.840−0.217 0 0 −1.587 1.783 0.236 0.419 0 b= 0.085,0.211,0.262,0.097,0.344 c= 0,0.112,0.291,0.529,0.850 (25)
-
[61]
STFT phase retrieval: A= 0 0 0 0 0 0.271 0 0 0 0 0.216 0.198 0 0 0 −0.029 0.147 0.454 0 0 0.072 0.208 0.326 0.244 0 b= 0.128,0.209,0.307,0.130,0.227 c= 0,0.271,0.413,0.572,0.850 (26)
-
[62]
Mel vocoding: A= 0 0 0 0 0 0.251 0 0 0 0 0.104 0.286 0 0 0 −0.005 0.200 0.379 0 0 0.091 0.181 0.344 0.234 0 b= 0.134,0.208,0.307,0.122,0.229 c= 0,0.251,0.390,0.574,0.850 (27)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.