WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

Hisashi Kawai; Sakriani Sakti; Takuma Okamoto; Wangzixi Zhou; Yamato Ohtani

arxiv: 2605.25506 · v1 · pith:OR2KUB7Inew · submitted 2026-05-25 · 📡 eess.AS

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

Wangzixi Zhou , Takuma Okamoto , Yamato Ohtani , Sakriani Sakti , Hisashi Kawai This is my paper

classification 📡 eess.AS

keywords convnext-baseddiffusionfastermodelsvocoderswavenextdenoisingdiff-wavenext

0 comments

read the original abstract

Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, diffusion models, despite training faster than GANs, have slow CPU inference. In this paper, we introduce WaveNeXt 2, a unified ConvNeXt-based framework compatible with both GAN and diffusion vocoders. Its core innovation is residual denoising and sub-modeling, where each sub-model progressively refines the waveform. Experimental results in the multi-speaker dataset demonstrate the effectiveness of our approach: (1) GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit, and (2) Diff-WaveNeXt 2 also delivers much faster inference and competitive synthesis quality compared with FastDiff with 4 steps. The Diff-WaveNeXt 2 is very training-efficient, training in only 32 hours, making it ideal for resource-constrained applications.

This paper has not been read by Pith yet.

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

discussion (0)