pith. sign in

arxiv: 2605.16681 · v1 · pith:RX2VMUWNnew · submitted 2026-05-15 · 📡 eess.AS · cs.SD· eess.SP

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3

classification 📡 eess.AS cs.SDeess.SP
keywords audio super-resolutionbandwidth extensiongenerative modelsdeep neural networksdiffusion modelsgenerative adversarial networksspeech enhancement
0
0 comments X p. Extension
pith:RX2VMUWN Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{RX2VMUWN}

Prints a linked pith:RX2VMUWN badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Audio super-resolution is shifting from deterministic neural mappings that over-smooth high frequencies to generative models that sample plausible missing content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews methods for reconstructing high-fidelity audio from low-resolution or band-limited signals, a task made difficult by the ambiguity of the absent high-frequency details. Early approaches using discriminative deep neural networks treat the problem as a direct mapping and tend to produce averaged, overly smooth outputs. The paper then examines generative techniques including autoregressive models, variational autoencoders, generative adversarial networks, diffusion models, flow-based methods, and Schrödinger bridges. It analyzes their design choices in domains, architectures, and conditioning while weighing trade-offs in fidelity, perceptual quality, and efficiency. The survey supplies a taxonomy and roadmap to guide the move toward distribution-aware modeling.

Core claim

The authors organize the literature on audio bandwidth extension and super-resolution into a taxonomy that traces the progression from discriminative deep neural network models, which perform deterministic point estimation and suffer from regression-to-the-mean effects, to a range of generative models that explicitly model the distribution of possible high-frequency content.

What carries the argument

A taxonomy of model families from early discriminative DNNs through autoregressive, VAE, GAN, diffusion, flow-based, and Schrödinger bridge approaches, together with analysis of representation domain, architecture, and conditioning mechanisms.

If this is right

  • Generative models can produce varied high-frequency reconstructions instead of a single averaged result, better matching the ill-posed nature of the task.
  • Choices of conditioning mechanisms and representation domains directly influence the balance between reconstruction accuracy and perceptual naturalness.
  • Integration with large language models and multimodal foundation models offers pathways to leverage broader contextual information.
  • Persistent challenges remain in developing reliable perceptual evaluation metrics, accurate phase modeling, and generalization beyond controlled conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could help engineers pick a generative approach suited to real-time constraints on mobile devices for live audio restoration.
  • Similar shifts from deterministic to generative modeling seen here may appear in adjacent areas such as image or video resolution enhancement.
  • Quantitative benchmarks comparing representative models from each category on shared datasets would make the roadmap more actionable for practitioners.

Load-bearing premise

The chosen papers and proposed taxonomy accurately reflect the main developments and trade-offs in the field without major omissions or bias.

What would settle it

Publication of a high-impact audio super-resolution method that cannot be placed in any of the surveyed categories or that shows discriminative models consistently outperforming generative ones on standard perceptual metrics would test the survey's framing.

Figures

Figures reproduced from arXiv: 2605.16681 by Andrew C. Singer, Diego A. Cuji, Ningyuan Yang, Pu Zhao, Ryan M. Corey, Xue Lin, Yize Li.

Figure 1
Figure 1. Figure 1: Timeline of methodological evolution in BWE and SR (2017–present). The trajectory highlights a clear recent generative tendency: after the early dominance of deterministic models, modern likelihood-based or score-based generative approaches are increasingly shaping the state-of-the-art paradigm, reflecting a generative shift from point estimation to conditional distribution matching for perceptually plausi… view at source ↗
Figure 2
Figure 2. Figure 2: Signal flow diagram of BWE/SR. The degradation process removes HF spectral bandwidth from a reference signal y, followed by an optional resampling stage, to produce the observation x. In practical applications, this reference signal is not available. The BWE/SR system then estimates the reconstruction yˆ from the observation x. Waveforms and spectrograms at each stage visualize the transition from the refe… view at source ↗
Figure 3
Figure 3. Figure 3: Taxonomy of BWE/SR Literature. Existing methods are organized by target sampling rates {16, 22.05, 24, 44.1, 48, 96, 192} kHz and further categorized according to their spectral mapping paradigm {fixed-constraint, multi-scenario, bandwidth-agnostic} in training settings. range or selected from a discrete set, as in NVSR (Liu et al., 2022a), which samples cutoff frequencies over 1–16 kHz, and AP-BWE (Lu et … view at source ↗
Figure 4
Figure 4. Figure 4: The U-Net architecture. It employs a symmetric encoder-decoder structure with multi-scale skip connections that align the corresponding stages, while the bottleneck block forms the most compact latent representation. local details while integrating long-range contextual information, whereas the aligned skip connections fa￾cilitate direct cross-resolution feature propagation, mitigating information loss cau… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of a stack of dilated causal convolutional layers. Dilated causal convolutions with dilation factors of 1, 2, 4, and 8 are shown, where dilation specifies the spacing between consecutive filter taps, allowing the temporal receptive field to grow exponentially while preserving causality. dilation = 1, 2, 4, and 8. Each layer uses gated activation units (Van den Oord et al., 2016), which outper… view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of an unconditional VAE, where an encoder infers the latent distribution z parameterized by mean µϕ and variance σϕ, and a decoder reconstructs the input signal x via latent sampling using the reparameterization trick. 6.2 Variational Autoencoder (VAE) An unconditional variational autoencoder (VAE) (Doersch, 2016) models the generation of an input signal x ∈ R T through a latent variable z ∈ R… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of diffusion and bridge processes on audio spectrograms. (a) Diffusion: the forward process progressively corrupts the clean audio spectrogram x0 by injecting noise, whereas the learned reverse process iteratively denoises a heavily perturbed sample xT to recover x0. (b) Bridge: the forward bridge process degrades the HR spectrogram x0 into a LR spectrogram xT through progressive bandwidth red… view at source ↗
read the original abstract

Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schr\"odinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey on audio super-resolution (SR) and bandwidth extension (BWE). It reviews the progression from early discriminative DNN models, which suffer from regression-to-the-mean and over-smoothing, to generative approaches including autoregressive models, VAEs, GANs, diffusion/score-based models, flow-based methods, and Schrödinger bridges. The survey analyzes design choices across representation domain, architecture, and conditioning mechanisms, along with trade-offs in fidelity, perceptual quality, robustness, and efficiency. It covers emerging work on LLMs and multimodal models, identifies open challenges in perceptual evaluation, phase modeling, and generalization, and proposes a structured taxonomy with a unified perspective and practical roadmap for the field.

Significance. If the taxonomy accurately organizes the literature, the survey provides a timely synthesis of the shift toward distribution-aware generative modeling, which directly addresses the ill-posed nature of BWE/SR. This unified view and roadmap can help researchers navigate method selection based on explicit trade-offs and may accelerate progress by highlighting gaps such as robust real-world evaluation. The explicit contrast between deterministic point estimation and generative alternatives is a clear strength that organizes an otherwise fragmented area.

minor comments (3)
  1. [Introduction] The abstract and introduction claim a 'comprehensive overview' and 'structured taxonomy'; adding an explicit description of the literature search strategy, inclusion/exclusion criteria, and approximate number of papers reviewed would strengthen reader confidence in coverage without altering the central narrative.
  2. [Generative Approaches] In the sections reviewing generative models, quantitative comparisons (e.g., reported PESQ, STOI, or perceptual metrics across GANs, diffusion, and flow methods) are mentioned but not consolidated; a summary table would make the trade-off analysis more actionable and easier to reference.
  3. [Open Challenges] The discussion of open challenges in phase modeling would benefit from one or two concrete citations to recent generative works that explicitly model or bypass phase, to illustrate the practical status of the problem.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The recognition of the survey's taxonomy, unified perspective on the shift from discriminative to generative modeling, and identification of open challenges is appreciated.

Circularity Check

0 steps flagged

No circularity: survey compiles external literature without internal derivations

full rationale

This is a survey paper that reviews existing work on audio super-resolution and bandwidth extension, organizing it into a taxonomy from discriminative to generative models. The central claim is descriptive—providing a structured overview and roadmap—rather than deriving new predictions or results from equations within the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; all referenced methods and results are drawn from external literature. The paper does not contain derivations, uniqueness theorems, or ansatzes that reduce to its own inputs by construction, making the work self-contained as a literature synthesis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution rests on the authors' selection and organization of prior literature in audio signal processing and generative modeling; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5805 in / 1114 out tokens · 45507 ms · 2026-05-19T20:23:18.122031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 20 internal anchors

  1. [1]

    Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

    Pavel Andreev et al. Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

  2. [2]

    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,

  3. [3]

    Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

    Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

  4. [4]

    Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality

    Hongtao Bao and Xueliang Zhang. Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality. InProc. Interspeech 2025,

  5. [5]

    Cmgan: Conformer-based metric gan for speech enhancement

    Ruizhe Cao, Sherif Abdulatif, and Bin Yang. Cmgan: Conformer-based metric gan for speech enhancement. arXiv preprint arXiv:2203.15149,

  6. [6]

    Schrodinger bridges beat diffusion models on text-to-speech synthesis.arXiv preprint arXiv:2312.03491,

    Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, and Jun Zhu. Schrodinger bridges beat diffusion models on text-to-speech synthesis.arXiv preprint arXiv:2312.03491,

  7. [7]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

  8. [8]

    The design for the wall street journal-based

    C Corpus. The design for the wall street journal-based. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pp

  9. [9]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  10. [10]

    FMA: A Dataset For Music Analysis

    Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis.arXiv preprint arXiv:1612.01840,

  11. [11]

    Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

    Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

  12. [12]

    Tutorial on variational autoencoders,

    Carl Doersch. Tutorial on variational autoencoders.arXiv preprint arXiv:1606.05908,

  13. [13]

    Adversarial Audio Synthesis

    30 Chris Donahue, Bo Li, and Rohit Prabhavalkar. Exploring speech enhancement with generative adversarial networks for robust speech recognition. InICASSP. IEEE, 2018a. Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018b. Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super...

  14. [14]

    GANSynth: Adversarial Neural Audio Synthesis

    Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audio synthesis.arXiv preprint arXiv:1902.08710,

  15. [15]

    Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

    Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Sid- dharth Gururani, Sang gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next-generation open audio-language models for...

  16. [16]

    Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

    31 Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, and Xipeng Qiu. Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

  17. [17]

    Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder

    Yicheng Gu, Xueyao Zhang, Liumeng Xue, and Zhizheng Wu. Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10616–10620. IEEE,

  18. [18]

    Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

    Seungu Han and Junhyeok Lee. Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

  19. [19]

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,

  20. [20]

    Visqol: an objective speech quality model

    Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):13,

  21. [21]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  22. [22]

    Towards real-time generative speech restoration with flow-matching

    32 Tsun-An Hsieh and Sebastian Braun. Towards real-time generative speech restoration with flow-matching. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15847–15851. IEEE,

  23. [23]

    Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

    Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

  24. [24]

    Saga-sr: Semantically and acoustically guided audio super-resolution

    Jaekwon Im and Juhan Nam. Saga-sr: Semantically and acoustically guided audio super-resolution. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1706–1710. IEEE,

  25. [25]

    Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889,

    Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889,

  26. [26]

    Neural Machine Translation in Linear Time

    Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time.arXiv preprint arXiv:1610.10099,

  27. [27]

    Bandwidth Extension on Raw Audio via Generative Adversarial Networks

    Donghyun Kim, Yungyeo Kim, and Joon-Hyuk Chang. Class: Continual learning approach for speech super-resolution. InICASSP. IEEE, 2024a. Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee. Audio super-resolution with robust speech representation learning of masked autoencoder.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1...

  28. [28]

    Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a

    Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a. ZhenglunKong, YizeLi, FanhuZeng, LeiXin, etal. Tokenreductionshouldgobeyondefficiencyingenerative models – from vision, language to multimodality.arXiv preprint ar...

  29. [29]

    Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a

    Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexan- dre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a. Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. V...

  30. [30]

    Fastwave: Optimized diffusion model for audio super-resolution

    Nikita Kuznetsov and Maksim Kaledin. Fastwave: Optimized diffusion model for audio super-resolution. arXiv preprint arXiv:2603.04122,

  31. [31]

    Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

  32. [32]

    Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

    Yongjoon Lee and Jung-Woo Choi. Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

  33. [33]

    Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration

    Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  34. [34]

    A survey of the schrödinger problem and some of its connections with optimal transport

    Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215,

  35. [35]

    Bridge-sr: Schrödinger bridge for efficient sr

    34 Chang Li, Zehua Chen, Fan Bao, and Jun Zhu. Bridge-sr: Schrödinger bridge for efficient sr. InICASSP. IEEE, 2025a. Chang Li, Zehua Chen, Liyuan Wang, and Jun Zhu. Audio super-resolution with latent bridge models.arXiv preprint arXiv:2509.17609, 2025b. Changtao Li, Feiran Yang, and Jun Yang. Restoration of bone-conducted speech with u-net-like model and...

  36. [36]

    A two-stage approach to speech bandwidth extension

    Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A two-stage approach to speech bandwidth extension. InInterspeech, volume 2021, pp. 1689–1693,

  37. [37]

    Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion

    Yin-Tse Lin, Shreya G Upadhyay, Bo-Hao Su, and Chi-Chun Lee. Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion. InProc. Interspeech 2024,

  38. [38]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  39. [39]

    Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a

    Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a. Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. Voicefixer: A unified framework for high-fidelity speech restoration.arXiv...

  40. [40]

    Audiosr: Versatile audio super- resolution at scale

    Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. Audiosr: Versatile audio super- resolution at scale. InICASSP. IEEE, 2024a. Xi Liu, Mu Yang, Szu-Jui Chen, and John HL Hansen. A neural codec approach for noise-robust bandwidth expansion. InProc. Interspeech 2025, 2025a. Xin Liu, Shulin He, and Xueliang Zhang. Hwb-net: A novel high-performan...

  41. [41]

    Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

    Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

  42. [42]

    Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a

    Ye-Xin Lu, Yang Ai, Hui-Peng Du, and Zhen-Hua Ling. Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a. Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, and Zhen-Hua Ling. Multi-stage speech bandwidth extension with flexible sampling rate con...

  43. [43]

    The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv preprint arXiv:2311.10057,

    Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv preprint arXiv:2311.10057,

  44. [44]

    Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

    Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

  45. [45]

    Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

    Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

  46. [46]

    Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,

    Igor Pereira, Felipe Araújo, Filip Korzeniowski, and Richard Vogl. Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,

  47. [47]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

    Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497. IEEE,

  48. [48]

    Artificial bandwidth extension using a conditional generative adversarial network with discriminative training

    Jonas Sautter, Friedrich Faubel, Markus Buck, and Gerhard Schmidt. Artificial bandwidth extension using a conditional generative adversarial network with discriminative training. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7005–7009. IEEE,

  49. [49]

    Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

    Robin Scheibler, Yusuke Fujita, Yuma Shirahata, and Tatsuya Komatsu. Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

  50. [50]

    Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,

    Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz, and Davide Scaini. Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,

  51. [51]

    mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

    Chenhao Shuai, Chaohua Shi, Lu Gan, and Hongqing Liu. mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

  52. [52]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  53. [53]

    Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

    38 Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation.arXiv preprint arXiv:1806.03185,

  54. [54]

    Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

    Tarikul Islam Tamiti and Anomadarshi Barua. Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

  55. [55]

    A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

    Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, and Anomadarshi Barua. A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

  56. [56]

    A convolutional recurrent neural network for real-time speech enhancement

    Ke Tan and DeLiang Wang. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, volume 2018, pp. 3229–3233,

  57. [57]

    Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

    Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, and Shan Liu. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

  58. [58]

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482,

  59. [59]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  60. [60]

    WaveNet: A Generative Model for Raw Audio

    Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12:1,

  61. [61]

    Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a

    Siyi Wang, Siyi Liu, Andrew Harper, Paul Kendrick, Mathieu Salzmann, and Milos Cernak. Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a. Yingxue Wang, Shenghui Zhao, Wenbo Liu, Ming Li, and Jingming Kuang. Speech bandwidth expansion based on deep neural networks. InINTERSPEECH,

  62. [62]

    Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b

    Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, and Jianfei Chen. Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b. Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, and Hongyu Xiong. Filter-and-refine: A mllm based cascade system for in...

  63. [63]

    Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations

    Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations. InInterspeech

  64. [64]

    Step-Audio 2 Technical Report

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632,

  65. [65]

    The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

    Yusong Wu, Josh Gardner, Ethan Manilow, Ian Simon, Curtis Hawthorne, and Jesse Engel. The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

  66. [66]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  67. [67]

    Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

    Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, et al. Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

  68. [68]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

  69. [69]

    Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

    Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, and A S Madhukumar. Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

  70. [70]

    Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

    Kexun Zhang, Yi Ren, Changliang Xu, and Zhou Zhao. Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

  71. [71]

    Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

    Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, et al. Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

  72. [72]

    Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,

    Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon. Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,