A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

Andrew C. Singer; Diego A. Cuji; Ningyuan Yang; Pu Zhao; Ryan M. Corey; Xue Lin; Yize Li

arxiv: 2605.16681 · v1 · pith:RX2VMUWNnew · submitted 2026-05-15 · 📡 eess.AS · cs.SD· eess.SP

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

Ningyuan Yang , Yize Li , Diego A. Cuji , Ryan M. Corey , Pu Zhao , Xue Lin , Andrew C. Singer This is my paper

Pith reviewed 2026-05-19 20:23 UTC · model grok-4.3

classification 📡 eess.AS cs.SDeess.SP

keywords audio super-resolutionbandwidth extensiongenerative modelsdeep neural networksdiffusion modelsgenerative adversarial networksspeech enhancement

0 comments

The pith

Audio super-resolution is shifting from deterministic neural mappings that over-smooth high frequencies to generative models that sample plausible missing content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews methods for reconstructing high-fidelity audio from low-resolution or band-limited signals, a task made difficult by the ambiguity of the absent high-frequency details. Early approaches using discriminative deep neural networks treat the problem as a direct mapping and tend to produce averaged, overly smooth outputs. The paper then examines generative techniques including autoregressive models, variational autoencoders, generative adversarial networks, diffusion models, flow-based methods, and Schrödinger bridges. It analyzes their design choices in domains, architectures, and conditioning while weighing trade-offs in fidelity, perceptual quality, and efficiency. The survey supplies a taxonomy and roadmap to guide the move toward distribution-aware modeling.

Core claim

The authors organize the literature on audio bandwidth extension and super-resolution into a taxonomy that traces the progression from discriminative deep neural network models, which perform deterministic point estimation and suffer from regression-to-the-mean effects, to a range of generative models that explicitly model the distribution of possible high-frequency content.

What carries the argument

A taxonomy of model families from early discriminative DNNs through autoregressive, VAE, GAN, diffusion, flow-based, and Schrödinger bridge approaches, together with analysis of representation domain, architecture, and conditioning mechanisms.

If this is right

Generative models can produce varied high-frequency reconstructions instead of a single averaged result, better matching the ill-posed nature of the task.
Choices of conditioning mechanisms and representation domains directly influence the balance between reconstruction accuracy and perceptual naturalness.
Integration with large language models and multimodal foundation models offers pathways to leverage broader contextual information.
Persistent challenges remain in developing reliable perceptual evaluation metrics, accurate phase modeling, and generalization beyond controlled conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could help engineers pick a generative approach suited to real-time constraints on mobile devices for live audio restoration.
Similar shifts from deterministic to generative modeling seen here may appear in adjacent areas such as image or video resolution enhancement.
Quantitative benchmarks comparing representative models from each category on shared datasets would make the roadmap more actionable for practitioners.

Load-bearing premise

The chosen papers and proposed taxonomy accurately reflect the main developments and trade-offs in the field without major omissions or bias.

What would settle it

Publication of a high-impact audio super-resolution method that cannot be placed in any of the surveyed categories or that shows discriminative models consistently outperforming generative ones on standard perceptual metrics would test the survey's framing.

Figures

Figures reproduced from arXiv: 2605.16681 by Andrew C. Singer, Diego A. Cuji, Ningyuan Yang, Pu Zhao, Ryan M. Corey, Xue Lin, Yize Li.

**Figure 1.** Figure 1: Timeline of methodological evolution in BWE and SR (2017–present). The trajectory highlights a clear recent generative tendency: after the early dominance of deterministic models, modern likelihood-based or score-based generative approaches are increasingly shaping the state-of-the-art paradigm, reflecting a generative shift from point estimation to conditional distribution matching for perceptually plausi… view at source ↗

**Figure 2.** Figure 2: Signal flow diagram of BWE/SR. The degradation process removes HF spectral bandwidth from a reference signal y, followed by an optional resampling stage, to produce the observation x. In practical applications, this reference signal is not available. The BWE/SR system then estimates the reconstruction yˆ from the observation x. Waveforms and spectrograms at each stage visualize the transition from the refe… view at source ↗

**Figure 3.** Figure 3: Taxonomy of BWE/SR Literature. Existing methods are organized by target sampling rates {16, 22.05, 24, 44.1, 48, 96, 192} kHz and further categorized according to their spectral mapping paradigm {fixed-constraint, multi-scenario, bandwidth-agnostic} in training settings. range or selected from a discrete set, as in NVSR (Liu et al., 2022a), which samples cutoff frequencies over 1–16 kHz, and AP-BWE (Lu et … view at source ↗

**Figure 4.** Figure 4: The U-Net architecture. It employs a symmetric encoder-decoder structure with multi-scale skip connections that align the corresponding stages, while the bottleneck block forms the most compact latent representation. local details while integrating long-range contextual information, whereas the aligned skip connections facilitate direct cross-resolution feature propagation, mitigating information loss cau… view at source ↗

**Figure 5.** Figure 5: Visualization of a stack of dilated causal convolutional layers. Dilated causal convolutions with dilation factors of 1, 2, 4, and 8 are shown, where dilation specifies the spacing between consecutive filter taps, allowing the temporal receptive field to grow exponentially while preserving causality. dilation = 1, 2, 4, and 8. Each layer uses gated activation units (Van den Oord et al., 2016), which outper… view at source ↗

**Figure 6.** Figure 6: Architecture of an unconditional VAE, where an encoder infers the latent distribution z parameterized by mean µϕ and variance σϕ, and a decoder reconstructs the input signal x via latent sampling using the reparameterization trick. 6.2 Variational Autoencoder (VAE) An unconditional variational autoencoder (VAE) (Doersch, 2016) models the generation of an input signal x ∈ R T through a latent variable z ∈ R… view at source ↗

**Figure 7.** Figure 7: Illustration of diffusion and bridge processes on audio spectrograms. (a) Diffusion: the forward process progressively corrupts the clean audio spectrogram x0 by injecting noise, whereas the learned reverse process iteratively denoises a heavily perturbed sample xT to recover x0. (b) Bridge: the forward bridge process degrades the HR spectrogram x0 into a LR spectrogram xT through progressive bandwidth red… view at source ↗

read the original abstract

Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schr\"odinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent survey that organizes the shift from discriminative to generative methods in audio super-resolution but adds no new results or experiments.

read the letter

Hi colleague, The punchline on this arXiv paper is that it's a literature survey on audio super-resolution, also called bandwidth extension, that charts the field's move from discriminative deep networks to generative models. It synthesizes existing approaches rather than presenting fresh experiments or derivations. What stands out positively is the structured review. The authors outline early discriminative DNNs and their issues with over-smoothing and regression to the mean. They then systematically go through generative techniques: autoregressive models, VAEs, GANs, diffusion and score-based models, flow-based methods, and Schrödinger bridges. Across these, they discuss practical design elements such as the representation domain, network architectures, conditioning strategies on the low-resolution input, and the trade-offs involving reconstruction accuracy, how natural the output sounds, robustness to variations, and how much compute is needed. The coverage of emerging topics like integration with large language models and multimodal models, along with persistent issues in perceptual assessment, phase reconstruction, and performance outside controlled settings, gives a forward-looking angle. The weaker aspects are those common to survey papers. The taxonomy and perspective depend on which works the authors chose to include, so there could be gaps or a particular slant even if the abstract shows a logical flow. Since this is not an empirical study, there are no new results to verify or code to run. The claim of offering a comprehensive foundation and roadmap is reasonable for organizing the literature but doesn't resolve the core ill-posed nature of the task. This kind of paper is for audio processing specialists or students who need a map of the current landscape to choose methods or spot open problems. It would appeal to someone working on generative audio or signal enhancement who wants context without reading dozens of papers separately. Overall, I think it merits peer review. Useful surveys help the field even when they don't push the technical frontier.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey on audio super-resolution (SR) and bandwidth extension (BWE). It reviews the progression from early discriminative DNN models, which suffer from regression-to-the-mean and over-smoothing, to generative approaches including autoregressive models, VAEs, GANs, diffusion/score-based models, flow-based methods, and Schrödinger bridges. The survey analyzes design choices across representation domain, architecture, and conditioning mechanisms, along with trade-offs in fidelity, perceptual quality, robustness, and efficiency. It covers emerging work on LLMs and multimodal models, identifies open challenges in perceptual evaluation, phase modeling, and generalization, and proposes a structured taxonomy with a unified perspective and practical roadmap for the field.

Significance. If the taxonomy accurately organizes the literature, the survey provides a timely synthesis of the shift toward distribution-aware generative modeling, which directly addresses the ill-posed nature of BWE/SR. This unified view and roadmap can help researchers navigate method selection based on explicit trade-offs and may accelerate progress by highlighting gaps such as robust real-world evaluation. The explicit contrast between deterministic point estimation and generative alternatives is a clear strength that organizes an otherwise fragmented area.

minor comments (3)

[Introduction] The abstract and introduction claim a 'comprehensive overview' and 'structured taxonomy'; adding an explicit description of the literature search strategy, inclusion/exclusion criteria, and approximate number of papers reviewed would strengthen reader confidence in coverage without altering the central narrative.
[Generative Approaches] In the sections reviewing generative models, quantitative comparisons (e.g., reported PESQ, STOI, or perceptual metrics across GANs, diffusion, and flow methods) are mentioned but not consolidated; a summary table would make the trade-off analysis more actionable and easier to reference.
[Open Challenges] The discussion of open challenges in phase modeling would benefit from one or two concrete citations to recent generative works that explicitly model or bypass phase, to illustrate the practical status of the problem.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The recognition of the survey's taxonomy, unified perspective on the shift from discriminative to generative modeling, and identification of open challenges is appreciated.

Circularity Check

0 steps flagged

No circularity: survey compiles external literature without internal derivations

full rationale

This is a survey paper that reviews existing work on audio super-resolution and bandwidth extension, organizing it into a taxonomy from discriminative to generative models. The central claim is descriptive—providing a structured overview and roadmap—rather than deriving new predictions or results from equations within the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; all referenced methods and results are drawn from external literature. The paper does not contain derivations, uniqueness theorems, or ansatzes that reduce to its own inputs by construction, making the work self-contained as a literature synthesis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution rests on the authors' selection and organization of prior literature in audio signal processing and generative modeling; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5805 in / 1114 out tokens · 45507 ms · 2026-05-19T20:23:18.122031+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models
cs.CV 2026-06 unverdicted novelty 6.0

ScAle learns scalar coefficients to modulate last-token attention and MLP activations in frozen VLMs, achieving up to 134.1% relative accuracy gains on spatial benchmarks with only 1K parameters.
FSD50K-Solo: Automated Curation of Single-Source Sound Events
eess.AS 2026-05 conditional novelty 6.0

The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
FSD50K-Solo: Automated Curation of Single-Source Sound Events
eess.AS 2026-05 unverdicted novelty 6.0

A curation pipeline combining diffusion-based synthetic mixtures with a discriminative classifier produces and releases FSD50K-Solo, a single-source subset of FSD50K that matches human expert labels on a test set.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 2 Pith papers · 21 internal anchors

[1]

Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

Pavel Andreev et al. Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

work page arXiv
[2]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

work page arXiv
[4]

Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality

Hongtao Bao and Xueliang Zhang. Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality. InProc. Interspeech 2025,

work page 2025
[5]

Cmgan: Conformer-based metric gan for speech enhancement

Ruizhe Cao, Sherif Abdulatif, and Bin Yang. Cmgan: Conformer-based metric gan for speech enhancement. arXiv preprint arXiv:2203.15149,

work page arXiv
[6]

Schrodinger bridges beat diffusion models on text-to-speech synthesis,

Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, and Jun Zhu. Schrodinger bridges beat diffusion models on text-to-speech synthesis.arXiv preprint arXiv:2312.03491,

work page arXiv
[7]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The design for the wall street journal-based

C Corpus. The design for the wall street journal-based. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pp

work page 1992
[9]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis.arXiv preprint arXiv:1612.01840,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

work page arXiv 2006
[12]

arXiv preprint arXiv:1606.05908 , year=

Carl Doersch. Tutorial on variational autoencoders.arXiv preprint arXiv:1606.05908,

work page arXiv
[13]

Adversarial Audio Synthesis

30 Chris Donahue, Bo Li, and Rohit Prabhavalkar. Exploring speech enhancement with generative adversarial networks for robust speech recognition. InICASSP. IEEE, 2018a. Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018b. Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

GANSynth: Adversarial Neural Audio Synthesis

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audio synthesis.arXiv preprint arXiv:1902.08710,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[15]

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Sid- dharth Gururani, Sang gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next-generation open audio-language models for...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

31 Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, and Xipeng Qiu. Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

work page arXiv
[17]

Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder

Yicheng Gu, Xueyao Zhang, Liumeng Xue, and Zhizheng Wu. Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10616–10620. IEEE,

work page 2024
[18]

Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

Seungu Han and Junhyeok Lee. Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

work page arXiv
[19]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Visqol: an objective speech quality model

Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):13,

work page 2015
[21]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Towards real-time generative speech restoration with flow-matching

32 Tsun-An Hsieh and Sebastian Braun. Towards real-time generative speech restoration with flow-matching. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15847–15851. IEEE,

work page 2026
[23]

Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

work page arXiv 2008
[24]

Saga-sr: Semantically and acoustically guided audio super-resolution

Jaekwon Im and Juhan Nam. Saga-sr: Semantically and acoustically guided audio super-resolution. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1706–1710. IEEE,

work page 2026
[25]

Univnet: A neural vocoder with multi- resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889, 2021

Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889,

work page arXiv
[26]

Neural Machine Translation in Linear Time

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time.arXiv preprint arXiv:1610.10099,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Bandwidth Extension on Raw Audio via Generative Adversarial Networks

Donghyun Kim, Yungyeo Kim, and Joon-Hyuk Chang. Class: Continual learning approach for speech super-resolution. InICASSP. IEEE, 2024a. Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee. Audio super-resolution with robust speech representation learning of masked autoencoder.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1...

work page internal anchor Pith review Pith/arXiv arXiv 1903
[28]

Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a

Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a. ZhenglunKong, YizeLi, FanhuZeng, LeiXin, etal. Tokenreductionshouldgobeyondefficiencyingenerative models – from vision, language to multimodality.arXiv preprint ar...

work page arXiv
[29]

Kumar, M

Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexan- dre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a. Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. V...

work page arXiv 1903
[30]

Fastwave: Optimized diffusion model for audio super-resolution

Nikita Kuznetsov and Maksim Kaledin. Fastwave: Optimized diffusion model for audio super-resolution. arXiv preprint arXiv:2603.04122,

work page arXiv
[31]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

work page arXiv
[32]

Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

Yongjoon Lee and Jung-Woo Choi. Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

work page internal anchor Pith review arXiv
[33]

Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration

Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023
[34]

arXiv preprint arXiv:1308.0215 , title =

Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215,

work page arXiv
[35]

arXiv preprint arXiv:2509.17609 (2025)

34 Chang Li, Zehua Chen, Fan Bao, and Jun Zhu. Bridge-sr: Schrödinger bridge for efficient sr. InICASSP. IEEE, 2025a. Chang Li, Zehua Chen, Liyuan Wang, and Jun Zhu. Audio super-resolution with latent bridge models.arXiv preprint arXiv:2509.17609, 2025b. Changtao Li, Feiran Yang, and Jun Yang. Restoration of bone-conducted speech with u-net-like model and...

work page arXiv
[36]

A two-stage approach to speech bandwidth extension

Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A two-stage approach to speech bandwidth extension. InInterspeech, volume 2021, pp. 1689–1693,

work page 2021
[37]

Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion

Yin-Tse Lin, Shreya G Upadhyay, Bo-Hao Su, and Chi-Chun Lee. Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion. InProc. Interspeech 2024,

work page 2024
[38]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a

Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a. Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. Voicefixer: A unified framework for high-fidelity speech restoration.arXiv...

work page arXiv
[40]

Audiosr: Versatile audio super- resolution at scale

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. Audiosr: Versatile audio super- resolution at scale. InICASSP. IEEE, 2024a. Xi Liu, Mu Yang, Szu-Jui Chen, and John HL Hansen. A neural codec approach for noise-robust bandwidth expansion. InProc. Interspeech 2025, 2025a. Xin Liu, Shulin He, and Xueliang Zhang. Hwb-net: A novel high-performan...

work page 2025
[41]

Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

work page arXiv 1904
[42]

Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a

Ye-Xin Lu, Yang Ai, Hui-Peng Du, and Zhen-Hua Ling. Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a. Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, and Zhen-Hua Ling. Multi-stage speech bandwidth extension with flexible sampling rate con...

work page arXiv
[43]

The song describer dataset: a corpus of audio captions for music-and-language evaluation,

Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv preprint arXiv:2311.10057,

work page arXiv
[44]

Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

work page arXiv
[45]

Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

work page arXiv
[46]

Moisesdb: A dataset for source separation beyond 4-stems,

Igor Pereira, Felipe Araújo, Filip Korzeniowski, and Richard Vogl. Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,

work page arXiv
[47]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497. IEEE,

work page 2021
[48]

Artificial bandwidth extension using a conditional generative adversarial network with discriminative training

Jonas Sautter, Friedrich Faubel, Markus Buck, and Gerhard Schmidt. Artificial bandwidth extension using a conditional generative adversarial network with discriminative training. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7005–7009. IEEE,

work page 2019
[49]

Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

Robin Scheibler, Yusuke Fujita, Yuma Shirahata, and Tatsuya Komatsu. Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

work page arXiv
[50]

Univer- sal speech enhancement with score-based diffusion,

Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz, and Davide Scaini. Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,

work page arXiv
[51]

mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

Chenhao Shuai, Chaohua Shi, Lu Gan, and Hongqing Liu. mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

work page arXiv
[52]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[53]

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

38 Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation.arXiv preprint arXiv:1806.03185,

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

Tarikul Islam Tamiti and Anomadarshi Barua. Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

work page arXiv
[55]

A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, and Anomadarshi Barua. A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

work page arXiv
[56]

A convolutional recurrent neural network for real-time speech enhancement

Ke Tan and DeLiang Wang. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, volume 2018, pp. 3229–3233,

work page 2018
[57]

Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, and Shan Liu. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

work page arXiv 2011
[58]

Improving and generalizing flow-based generative models with minibatch optimal transport

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[60]

WaveNet: A Generative Model for Raw Audio

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12:1,

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a

Siyi Wang, Siyi Liu, Andrew Harper, Paul Kendrick, Mathieu Salzmann, and Milos Cernak. Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a. Yingxue Wang, Shenghui Zhao, Wenbo Liu, Ming Li, and Jingming Kuang. Speech bandwidth expansion based on deep neural networks. InINTERSPEECH,

work page arXiv
[62]

Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b

Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, and Jianfei Chen. Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b. Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, and Hongyu Xiong. Filter-and-refine: A mllm based cascade system for in...

work page arXiv
[63]

Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations

Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations. InInterspeech

work page 2014
[64]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632,

work page internal anchor Pith review Pith/arXiv arXiv
[65]

The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

Yusong Wu, Josh Gardner, Ethan Manilow, Ian Simon, Curtis Hawthorne, and Jesse Engel. The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

work page arXiv
[66]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, et al. Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

work page arXiv
[68]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[69]

Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, and A S Madhukumar. Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

work page arXiv
[70]

Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

Kexun Zhang, Yi Ren, Changliang Xu, and Zhou Zhao. Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

work page arXiv
[71]

Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, et al. Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

work page arXiv
[72]

arXiv preprint arXiv:2309.16948 (2023)

Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon. Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,

work page arXiv

[1] [1]

Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

Pavel Andreev et al. Wv-mos: Mos score prediction by fine-tuned wav2vec 2.0.arXiv preprint arXiv:2203.13086,

work page arXiv

[2] [2]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

work page arXiv

[4] [4]

Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality

Hongtao Bao and Xueliang Zhang. Frequency-domain enhanced extreme bandwidth extension network with iccrn for superior speech quality. InProc. Interspeech 2025,

work page 2025

[5] [5]

Cmgan: Conformer-based metric gan for speech enhancement

Ruizhe Cao, Sherif Abdulatif, and Bin Yang. Cmgan: Conformer-based metric gan for speech enhancement. arXiv preprint arXiv:2203.15149,

work page arXiv

[6] [6]

Schrodinger bridges beat diffusion models on text-to-speech synthesis,

Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, and Jun Zhu. Schrodinger bridges beat diffusion models on text-to-speech synthesis.arXiv preprint arXiv:2312.03491,

work page arXiv

[7] [7]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The design for the wall street journal-based

C Corpus. The design for the wall street journal-based. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pp

work page 1992

[9] [9]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis.arXiv preprint arXiv:1612.01840,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain.arXiv preprint arXiv:2006.12847,

work page arXiv 2006

[12] [12]

arXiv preprint arXiv:1606.05908 , year=

Carl Doersch. Tutorial on variational autoencoders.arXiv preprint arXiv:1606.05908,

work page arXiv

[13] [13]

Adversarial Audio Synthesis

30 Chris Donahue, Bo Li, and Rohit Prabhavalkar. Exploring speech enhancement with generative adversarial networks for robust speech recognition. InICASSP. IEEE, 2018a. Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018b. Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

GANSynth: Adversarial Neural Audio Synthesis

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audio synthesis.arXiv preprint arXiv:1902.08710,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[15] [15]

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Sid- dharth Gururani, Sang gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next-generation open audio-language models for...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

31 Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, and Xipeng Qiu. Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

work page arXiv

[17] [17]

Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder

Yicheng Gu, Xueyao Zhang, Liumeng Xue, and Zhizheng Wu. Multi-scale sub-band constant-q transform discriminatorforhigh-fidelityvocoder. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10616–10620. IEEE,

work page 2024

[18] [18]

Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

Seungu Han and Junhyeok Lee. Nu-wave 2: A general neural audio upsampling model for various sampling rates.arXiv preprint arXiv:2206.08545,

work page arXiv

[19] [19]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Visqol: an objective speech quality model

Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):13,

work page 2015

[21] [21]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Towards real-time generative speech restoration with flow-matching

32 Tsun-An Hsieh and Sebastian Braun. Towards real-time generative speech restoration with flow-matching. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15847–15851. IEEE,

work page 2026

[23] [23]

Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement.arXiv preprint arXiv:2008.00264,

work page arXiv 2008

[24] [24]

Saga-sr: Semantically and acoustically guided audio super-resolution

Jaekwon Im and Juhan Nam. Saga-sr: Semantically and acoustically guided audio super-resolution. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1706–1710. IEEE,

work page 2026

[25] [25]

Univnet: A neural vocoder with multi- resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889, 2021

Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation.arXiv preprint arXiv:2106.07889,

work page arXiv

[26] [26]

Neural Machine Translation in Linear Time

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time.arXiv preprint arXiv:1610.10099,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Bandwidth Extension on Raw Audio via Generative Adversarial Networks

Donghyun Kim, Yungyeo Kim, and Joon-Hyuk Chang. Class: Continual learning approach for speech super-resolution. InICASSP. IEEE, 2024a. Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee. Audio super-resolution with robust speech representation learning of masked autoencoder.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1...

work page internal anchor Pith review Pith/arXiv arXiv 1903

[28] [28]

Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a

Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation.arXiv preprint arXiv:2109.05418, 2021a. ZhenglunKong, YizeLi, FanhuZeng, LeiXin, etal. Tokenreductionshouldgobeyondefficiencyingenerative models – from vision, language to multimodality.arXiv preprint ar...

work page arXiv

[29] [29]

Kumar, M

Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexan- dre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a. Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. V...

work page arXiv 1903

[30] [30]

Fastwave: Optimized diffusion model for audio super-resolution

Nikita Kuznetsov and Maksim Kaledin. Fastwave: Optimized diffusion model for audio super-resolution. arXiv preprint arXiv:2603.04122,

work page arXiv

[31] [31]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

work page arXiv

[32] [32]

Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

Yongjoon Lee and Jung-Woo Choi. Semamba++: A general speech restoration framework leveraging global, local, and periodic spectral patterns.arXiv preprint arXiv:2603.11669,

work page internal anchor Pith review arXiv

[33] [33]

Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration

Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. Analysing diffusion-based gen- erative approaches versus discriminative approaches for speech restoration. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023

[34] [34]

arXiv preprint arXiv:1308.0215 , title =

Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215,

work page arXiv

[35] [35]

arXiv preprint arXiv:2509.17609 (2025)

34 Chang Li, Zehua Chen, Fan Bao, and Jun Zhu. Bridge-sr: Schrödinger bridge for efficient sr. InICASSP. IEEE, 2025a. Chang Li, Zehua Chen, Liyuan Wang, and Jun Zhu. Audio super-resolution with latent bridge models.arXiv preprint arXiv:2509.17609, 2025b. Changtao Li, Feiran Yang, and Jun Yang. Restoration of bone-conducted speech with u-net-like model and...

work page arXiv

[36] [36]

A two-stage approach to speech bandwidth extension

Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A two-stage approach to speech bandwidth extension. InInterspeech, volume 2021, pp. 1689–1693,

work page 2021

[37] [37]

Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion

Yin-Tse Lin, Shreya G Upadhyay, Bo-Hao Su, and Chi-Chun Lee. Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion. InProc. Interspeech 2024,

work page 2024

[38] [38]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a

Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution.arXiv preprint arXiv:2203.14941, 2022a. Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. Voicefixer: A unified framework for high-fidelity speech restoration.arXiv...

work page arXiv

[40] [40]

Audiosr: Versatile audio super- resolution at scale

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. Audiosr: Versatile audio super- resolution at scale. InICASSP. IEEE, 2024a. Xi Liu, Mu Yang, Szu-Jui Chen, and John HL Hansen. A neural codec approach for noise-robust bandwidth expansion. InProc. Interspeech 2025, 2025a. Xin Liu, Shulin He, and Xueliang Zhang. Hwb-net: A novel high-performan...

work page 2025

[41] [41]

Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion.arXiv preprint arXiv:1904.08352,

work page arXiv 1904

[42] [42]

Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a

Ye-Xin Lu, Yang Ai, Hui-Peng Du, and Zhen-Hua Ling. Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024a. Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, and Zhen-Hua Ling. Multi-stage speech bandwidth extension with flexible sampling rate con...

work page arXiv

[43] [43]

The song describer dataset: a corpus of audio captions for music-and-language evaluation,

Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv preprint arXiv:2311.10057,

work page arXiv

[44] [44]

Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

work page arXiv

[45] [45]

Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

work page arXiv

[46] [46]

Moisesdb: A dataset for source separation beyond 4-stems,

Igor Pereira, Felipe Araújo, Filip Korzeniowski, and Richard Vogl. Moisesdb: A dataset for source separation beyond 4-stems.arXiv preprint arXiv:2307.15913,

work page arXiv

[47] [47]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497. IEEE,

work page 2021

[48] [48]

Artificial bandwidth extension using a conditional generative adversarial network with discriminative training

Jonas Sautter, Friedrich Faubel, Markus Buck, and Gerhard Schmidt. Artificial bandwidth extension using a conditional generative adversarial network with discriminative training. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7005–7009. IEEE,

work page 2019

[49] [49]

Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

Robin Scheibler, Yusuke Fujita, Yuma Shirahata, and Tatsuya Komatsu. Universal score-based speech enhancement with high content preservation.arXiv preprint arXiv:2406.12194,

work page arXiv

[50] [50]

Univer- sal speech enhancement with score-based diffusion,

Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz, and Davide Scaini. Universal speech enhancement with score-based diffusion.arXiv preprint arXiv:2206.03065,

work page arXiv

[51] [51]

mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

Chenhao Shuai, Chaohua Shi, Lu Gan, and Hongqing Liu. mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra.arXiv preprint arXiv:2305.11104,

work page arXiv

[52] [52]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[53] [53]

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

38 Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to-end audio source separation.arXiv preprint arXiv:1806.03185,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

Tarikul Islam Tamiti and Anomadarshi Barua. Nldsi-bwe: Non linear dynamical systems-inspired multi resolution discriminators for speech bandwidth extension.arXiv preprint arXiv:2510.01109,

work page arXiv

[55] [55]

A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, and Anomadarshi Barua. A high-fidelity speech super resolution network using a complex global attention module with spectro-temporal loss.arXiv preprint arXiv:2507.00229,

work page arXiv

[56] [56]

A convolutional recurrent neural network for real-time speech enhancement

Ke Tan and DeLiang Wang. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, volume 2018, pp. 3229–3233,

work page 2018

[57] [57]

Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, and Shan Liu. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis.arXiv preprint arXiv:2011.12206,

work page arXiv 2011

[58] [58]

Improving and generalizing flow-based generative models with minibatch optimal transport

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482,

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

WaveNet: A Generative Model for Raw Audio

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12:1,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a

Siyi Wang, Siyi Liu, Andrew Harper, Paul Kendrick, Mathieu Salzmann, and Milos Cernak. Diffusion- based speech enhancement with schrödinger odinger bridge and symmetric noise schedule.arXiv preprint arXiv:2409.05116, 2024a. Yingxue Wang, Shenghui Zhao, Wenbo Liu, Ming Li, and Jingming Kuang. Speech bandwidth expansion based on deep neural networks. InINTERSPEECH,

work page arXiv

[62] [62]

Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b

Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, and Jianfei Chen. Framebridge: Improving image-to-video generation with bridge models.arXiv preprint arXiv:2410.15371, 2024b. Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, and Hongyu Xiong. Filter-and-refine: A mllm based cascade system for in...

work page arXiv

[63] [63]

Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations

Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? no! an empirically-supported critique of interspeech 2014 tts evaluations. InInterspeech

work page 2014

[64] [64]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632,

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

Yusong Wu, Josh Gardner, Ethan Manilow, Ian Simon, Curtis Hawthorne, and Jesse Engel. The cham- ber ensemble generator: Limitless high-quality mir data via generative modeling.arXiv preprint arXiv:2209.14458,

work page arXiv

[66] [66]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, et al. Swinsrgan: Swin transformer-based generative adversarial network for high-fidelity speech super-resolution.arXiv preprint arXiv:2509.03913,

work page arXiv

[68] [68]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[69] [69]

Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, and A S Madhukumar. Codecflow: Effi- cient bandwidth extension via conditional flow matching in neural codec latent space.arXiv preprint arXiv:2603.02022,

work page arXiv

[70] [70]

Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

Kexun Zhang, Yi Ren, Changliang Xu, and Zhou Zhao. Wsrglow: A glow-based waveform generative model for audio super-resolution.arXiv preprint arXiv:2106.08507,

work page arXiv

[71] [71]

Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, et al. Urgent challenge: Universality, robustness, and generalizability for speech enhancement.arXiv preprint arXiv:2406.04660,

work page arXiv

[72] [72]

arXiv preprint arXiv:2309.16948 (2023)

Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon. Denoising diffusion bridge models.arXiv preprint arXiv:2309.16948,

work page arXiv