A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3
pith:RX2VMUWN Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RX2VMUWN}
Prints a linked pith:RX2VMUWN badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Audio super-resolution is shifting from deterministic neural mappings that over-smooth high frequencies to generative models that sample plausible missing content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors organize the literature on audio bandwidth extension and super-resolution into a taxonomy that traces the progression from discriminative deep neural network models, which perform deterministic point estimation and suffer from regression-to-the-mean effects, to a range of generative models that explicitly model the distribution of possible high-frequency content.
What carries the argument
A taxonomy of model families from early discriminative DNNs through autoregressive, VAE, GAN, diffusion, flow-based, and Schrödinger bridge approaches, together with analysis of representation domain, architecture, and conditioning mechanisms.
If this is right
- Generative models can produce varied high-frequency reconstructions instead of a single averaged result, better matching the ill-posed nature of the task.
- Choices of conditioning mechanisms and representation domains directly influence the balance between reconstruction accuracy and perceptual naturalness.
- Integration with large language models and multimodal foundation models offers pathways to leverage broader contextual information.
- Persistent challenges remain in developing reliable perceptual evaluation metrics, accurate phase modeling, and generalization beyond controlled conditions.
Where Pith is reading between the lines
- The taxonomy could help engineers pick a generative approach suited to real-time constraints on mobile devices for live audio restoration.
- Similar shifts from deterministic to generative modeling seen here may appear in adjacent areas such as image or video resolution enhancement.
- Quantitative benchmarks comparing representative models from each category on shared datasets would make the roadmap more actionable for practitioners.
Load-bearing premise
The chosen papers and proposed taxonomy accurately reflect the main developments and trade-offs in the field without major omissions or bias.
What would settle it
Publication of a high-impact audio super-resolution method that cannot be placed in any of the surveyed categories or that shows discriminative models consistently outperforming generative ones on standard perceptual metrics would test the survey's framing.
Figures
read the original abstract
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schr\"odinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey on audio super-resolution (SR) and bandwidth extension (BWE). It reviews the progression from early discriminative DNN models, which suffer from regression-to-the-mean and over-smoothing, to generative approaches including autoregressive models, VAEs, GANs, diffusion/score-based models, flow-based methods, and Schrödinger bridges. The survey analyzes design choices across representation domain, architecture, and conditioning mechanisms, along with trade-offs in fidelity, perceptual quality, robustness, and efficiency. It covers emerging work on LLMs and multimodal models, identifies open challenges in perceptual evaluation, phase modeling, and generalization, and proposes a structured taxonomy with a unified perspective and practical roadmap for the field.
Significance. If the taxonomy accurately organizes the literature, the survey provides a timely synthesis of the shift toward distribution-aware generative modeling, which directly addresses the ill-posed nature of BWE/SR. This unified view and roadmap can help researchers navigate method selection based on explicit trade-offs and may accelerate progress by highlighting gaps such as robust real-world evaluation. The explicit contrast between deterministic point estimation and generative alternatives is a clear strength that organizes an otherwise fragmented area.
minor comments (3)
- [Introduction] The abstract and introduction claim a 'comprehensive overview' and 'structured taxonomy'; adding an explicit description of the literature search strategy, inclusion/exclusion criteria, and approximate number of papers reviewed would strengthen reader confidence in coverage without altering the central narrative.
- [Generative Approaches] In the sections reviewing generative models, quantitative comparisons (e.g., reported PESQ, STOI, or perceptual metrics across GANs, diffusion, and flow methods) are mentioned but not consolidated; a summary table would make the trade-off analysis more actionable and easier to reference.
- [Open Challenges] The discussion of open challenges in phase modeling would benefit from one or two concrete citations to recent generative works that explicitly model or bypass phase, to illustrate the practical status of the problem.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The recognition of the survey's taxonomy, unified perspective on the shift from discriminative to generative modeling, and identification of open challenges is appreciated.
Circularity Check
No circularity: survey compiles external literature without internal derivations
full rationale
This is a survey paper that reviews existing work on audio super-resolution and bandwidth extension, organizing it into a taxonomy from discriminative to generative models. The central claim is descriptive—providing a structured overview and roadmap—rather than deriving new predictions or results from equations within the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; all referenced methods and results are drawn from external literature. The paper does not contain derivations, uniqueness theorems, or ansatzes that reduce to its own inputs by construction, making the work self-contained as a literature synthesis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,
Pavel Andreev et al. Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,
-
[2]
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,
Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,
-
[4]
Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality
Hongtao Bao and Xueliang Zhang. Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality. InProc. Interspeech 2025,
work page 2025
-
[5]
Cmgan: Conformer-based metric gan for speech enhancement
Ruizhe Cao, Sherif Abdulatif, and Bin Yang. Cmgan: Conformer-based metric gan for speech enhancement. arXiv preprint arXiv:2203.15149,
-
[6]
Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, and Jun Zhu. Schrodinger bridges beat diffusion models on text-to-speech synthesis.arXiv preprint arXiv:2312.03491,
-
[7]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
The design for the wall street journal-based
C Corpus. The design for the wall street journal-based. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pp
work page 1992
-
[9]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
FMA: A Dataset For Music Analysis
Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis.arXiv preprint arXiv:1612.01840,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,
-
[12]
Tutorial on variational autoencoders,
Carl Doersch. Tutorial on variational autoencoders.arXiv preprint arXiv:1606.05908,
-
[13]
30 Chris Donahue, Bo Li, and Rohit Prabhavalkar. Exploring speech enhancement with generative adversarial networks for robust speech recognition. InICASSP. IEEE, 2018a. Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018b. Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
GANSynth: Adversarial Neural Audio Synthesis
Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audio synthesis.arXiv preprint arXiv:1902.08710,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[15]
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Sid- dharth Gururani, Sang gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next-generation open audio-language models for...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
31 Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, and Xipeng Qiu. Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,
-
[17]
Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder
Yicheng Gu, Xueyao Zhang, Liumeng Xue, and Zhizheng Wu. Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10616–10620. IEEE,
work page 2024
-
[18]
Seungu Han and Junhyeok Lee. Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,
-
[19]
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Visqol: an objective speech quality model
Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):13,
work page 2015
-
[21]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Towards real-time generative speech restoration with flow-matching
32 Tsun-An Hsieh and Sebastian Braun. Towards real-time generative speech restoration with flow-matching. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15847–15851. IEEE,
work page 2026
-
[23]
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,
-
[24]
Saga-sr: Semantically and acoustically guided audio super-resolution
Jaekwon Im and Juhan Nam. Saga-sr: Semantically and acoustically guided audio super-resolution. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1706–1710. IEEE,
work page 2026
-
[25]
Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889,
-
[26]
Neural Machine Translation in Linear Time
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time.arXiv preprint arXiv:1610.10099,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Bandwidth Extension on Raw Audio via Generative Adversarial Networks
Donghyun Kim, Yungyeo Kim, and Joon-Hyuk Chang. Class: Continual learning approach for speech super-resolution. InICASSP. IEEE, 2024a. Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee. Audio super-resolution with robust speech representation learning of masked autoencoder.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1...
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[28]
Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a. ZhenglunKong, YizeLi, FanhuZeng, LeiXin, etal. Tokenreductionshouldgobeyondefficiencyingenerative models – from vision, language to multimodality.arXiv preprint ar...
-
[29]
Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a
Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexan- dre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a. Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. V...
-
[30]
Fastwave: Optimized diffusion model for audio super-resolution
Nikita Kuznetsov and Maksim Kaledin. Fastwave: Optimized diffusion model for audio super-resolution. arXiv preprint arXiv:2603.04122,
-
[31]
Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,
-
[32]
Yongjoon Lee and Jung-Woo Choi. Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,
-
[33]
Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2023
-
[34]
A survey of the schrödinger problem and some of its connections with optimal transport
Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215,
-
[35]
Bridge-sr: Schrödinger bridge for efficient sr
34 Chang Li, Zehua Chen, Fan Bao, and Jun Zhu. Bridge-sr: Schrödinger bridge for efficient sr. InICASSP. IEEE, 2025a. Chang Li, Zehua Chen, Liyuan Wang, and Jun Zhu. Audio super-resolution with latent bridge models.arXiv preprint arXiv:2509.17609, 2025b. Changtao Li, Feiran Yang, and Jun Yang. Restoration of bone-conducted speech with u-net-like model and...
-
[36]
A two-stage approach to speech bandwidth extension
Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A two-stage approach to speech bandwidth extension. InInterspeech, volume 2021, pp. 1689–1693,
work page 2021
-
[37]
Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion
Yin-Tse Lin, Shreya G Upadhyay, Bo-Hao Su, and Chi-Chun Lee. Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion. InProc. Interspeech 2024,
work page 2024
-
[38]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a
Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a. Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. Voicefixer: A unified framework for high-fidelity speech restoration.arXiv...
-
[40]
Audiosr: Versatile audio super- resolution at scale
Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. Audiosr: Versatile audio super- resolution at scale. InICASSP. IEEE, 2024a. Xi Liu, Mu Yang, Szu-Jui Chen, and John HL Hansen. A neural codec approach for noise-robust bandwidth expansion. InProc. Interspeech 2025, 2025a. Xin Liu, Shulin He, and Xueliang Zhang. Hwb-net: A novel high-performan...
work page 2025
-
[41]
Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,
-
[42]
Ye-Xin Lu, Yang Ai, Hui-Peng Du, and Zhen-Hua Ling. Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a. Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, and Zhen-Hua Ling. Multi-stage speech bandwidth extension with flexible sampling rate con...
-
[43]
Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv preprint arXiv:2311.10057,
-
[44]
Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,
-
[45]
Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,
Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,
-
[46]
Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,
Igor Pereira, Felipe Araújo, Filip Korzeniowski, and Richard Vogl. Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,
-
[47]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors
Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497. IEEE,
work page 2021
-
[48]
Jonas Sautter, Friedrich Faubel, Markus Buck, and Gerhard Schmidt. Artificial bandwidth extension using a conditional generative adversarial network with discriminative training. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7005–7009. IEEE,
work page 2019
-
[49]
Robin Scheibler, Yusuke Fujita, Yuma Shirahata, and Tatsuya Komatsu. Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,
-
[50]
Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,
Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz, and Davide Scaini. Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,
-
[51]
Chenhao Shuai, Chaohua Shi, Lu Gan, and Hongqing Liu. mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,
-
[52]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[53]
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
38 Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation.arXiv preprint arXiv:1806.03185,
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Tarikul Islam Tamiti and Anomadarshi Barua. Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,
-
[55]
Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, and Anomadarshi Barua. A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,
-
[56]
A convolutional recurrent neural network for real-time speech enhancement
Ke Tan and DeLiang Wang. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, volume 2018, pp. 3229–3233,
work page 2018
-
[57]
Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, and Shan Liu. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,
-
[58]
Improving and generalizing flow-based generative models with minibatch optimal transport
Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
WaveNet: A Generative Model for Raw Audio
Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12:1,
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Siyi Wang, Siyi Liu, Andrew Harper, Paul Kendrick, Mathieu Salzmann, and Milos Cernak. Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a. Yingxue Wang, Shenghui Zhao, Wenbo Liu, Ming Li, and Jingming Kuang. Speech bandwidth expansion based on deep neural networks. InINTERSPEECH,
-
[62]
Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, and Jianfei Chen. Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b. Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, and Hongyu Xiong. Filter-and-refine: A mllm based cascade system for in...
-
[63]
Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations. InInterspeech
work page 2014
-
[64]
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632,
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Yusong Wu, Josh Gardner, Ethan Manilow, Ian Simon, Curtis Hawthorne, and Jesse Engel. The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,
-
[66]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, et al. Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,
-
[68]
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[69]
Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, and A S Madhukumar. Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,
-
[70]
Kexun Zhang, Yi Ren, Changliang Xu, and Zhou Zhao. Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,
-
[71]
Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, et al. Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,
-
[72]
Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,
Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon. Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.