Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Hui-Peng Du; Xiao-Hang Jiang; Yang Ai; Yuan Tian; Zhen-Hua Ling

arxiv: 2605.25669 · v1 · pith:VEM2BCXVnew · submitted 2026-05-25 · 📡 eess.AS

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Hui-Peng Du , Yang Ai , Xiao-Hang Jiang , Yuan Tian , Zhen-Hua Ling This is my paper

Pith reviewed 2026-06-29 19:42 UTC · model grok-4.3

classification 📡 eess.AS

keywords neural speech codingultra-low bitratemel-spectrogramflow matchingvector quantizationvocoderspeech reconstruction

0 comments

The pith

FMelCodec reconstructs natural speech at 250 bps by coding mel-spectrograms with a single VQ codebook, refining them via conditional flow matching, and vocoding the result.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FMelCodec as a three-stage neural codec that works directly on mel-spectrograms to reach ultra-low bitrates. The front-end compresses by a factor of 640 using one 1024-entry vector-quantization codebook plus online clustering to avoid collapse. A lightweight conditional flow-matching stage then refines the degraded spectrogram, after which a HiFi-GAN vocoder produces the final waveform. Experiments at 250 bps for 16 kHz and 750 bps for 48 kHz report higher objective scores, better listener ratings, and greater speaker similarity than prior methods, all with reduced model size and compute.

Core claim

The CRR framework (mel-spectrogram coding with aggressive VQ and clustering, CFM-based refinement via a velocity-field estimator and self-consistency training, and vocoder-driven waveform reconstruction) produces higher-quality speech and speaker similarity at 250 bps for 16 kHz and 750 bps for 48 kHz while using lower computational and model complexity than existing ultra-low-bitrate codecs.

What carries the argument

The conditional flow matching refinement stage, which estimates a velocity field to correct the mel-spectrogram output by the vector-quantized decoder before vocoding.

If this is right

Speech remains intelligible and speaker-consistent at bitrates well below those of conventional codecs.
The refinement stage improves both quality metrics and speaker similarity without large added cost.
Model size and inference compute stay lower than competing neural codecs at the same bitrate.
The same pipeline works for both 16 kHz and 48 kHz sampling rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-consistency training that allows fewer flow-matching steps could lower latency for real-time applications.
If the clustering trick generalizes, similar single-codebook designs might help other extreme-compression generative tasks.
The mel-spectrogram domain plus refinement might transfer to coding music or environmental audio at comparable bitrates.

Load-bearing premise

The online clustering strategy prevents codebook collapse and keeps the 1024-entry codebook diverse enough for usable reconstruction at the 640x compression ratio.

What would settle it

Subjective listening tests at 250 bps in which FMelCodec receives equal or lower mean opinion scores and speaker similarity ratings than the strongest baseline codec.

Figures

Figures reproduced from arXiv: 2605.25669 by Hui-Peng Du, Xiao-Hang Jiang, Yang Ai, Yuan Tian, Zhen-Hua Ling.

**Figure 2.** Figure 2: Architecture of the ConvNeXt v2–based mel-spectrogram encoder and decoder used in the mel-spectrogram coding stage [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the velocity field estimation network [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Spectrogram visualizations of natural speech, speech decoded by FMelCodec and the baseline methods, as well as speech reconstructed via HiFi-GAN [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: ABX preference results on the 16 kHz LibriTTS test set, comparing [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the natural mel-spectrogram f [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: ABX preference results on the 16 kHz LibriTTS test set, comparing FMelCodec at 250 bps with its ablated variants. “N/P” denotes no preference and p indicates the p-value of a paired t-test used for statistical analysis. sis–synthesis results (i.e., driven by natural mel-spectrograms) in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FMelCodec's three-stage mel-spectrogram pipeline with online VQ clustering, conditional flow matching, and HiFi-GAN is a sensible engineering stack for 250 bps speech, but the abstract supplies zero metrics so the performance claims and VQ stability cannot be checked.

read the letter

The paper puts forward FMelCodec as a CRR setup: heavy 640x mel-spectrogram compression via a single 1024-entry VQ with online clustering to stop collapse, followed by a lightweight conditional flow matching refiner trained with self-consistency, then a HiFi-GAN vocoder. It targets 250 bps at 16 kHz and 750 bps at 48 kHz and claims better reconstruction quality plus speaker similarity than prior work, plus lower complexity.

The combination itself looks like the main new piece. Pairing aggressive single-codebook VQ with a flow-matching cleanup stage and self-consistency training to cut inference steps is not a direct copy of the cited prior art. The online reassignment trick for the codebook is a practical detail that addresses a known VQ failure mode at extreme ratios.

The approach makes engineering sense. After such heavy quantization the mel-spectrogram is badly damaged, so inserting a learned refiner before the vocoder is a logical way to try to restore naturalness without raising the bitrate.

The clear weakness is the complete absence of numbers. No PESQ, STOI, MOS, speaker similarity scores, no baseline comparisons, no dataset descriptions, and no mention of statistical tests appear in the abstract. Without those the central claim stays untestable. The stress-test point about codebook diversity at 25 codes per second is on target; if the online clustering fails to keep entropy up, the CFM stage has little usable signal to refine and the whole bitrate-quality story falls apart. The full paper will have to demonstrate that the 1024-entry codebook actually stays diverse.

This is for people working on neural speech codecs for bandwidth-limited links. A reader already building or evaluating low-bitrate systems would get value from the pipeline description and the self-consistency trick once the results are shown.

It deserves a serious referee. The framework is concrete enough to review on its own terms, and the missing experimental details are exactly what referees can demand.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes FMelCodec, a three-stage CRR neural speech codec operating at 250 bps (16 kHz) and 750 bps (48 kHz). The front-end uses a 640x-compressed mel-spectrogram encoder-decoder with a single 1024-entry VQ codebook plus online clustering to avoid collapse; this is followed by a lightweight conditional flow-matching refinement stage with self-consistency training and a final HiFi-GAN vocoder. The central claim is that the system yields higher objective and subjective reconstruction quality and speaker similarity than prior methods while using lower model and computational complexity, validated on two datasets at the two sampling rates.

Significance. If the performance claims and the effectiveness of the online clustering at 640x compression are substantiated, the work would provide a concrete demonstration that aggressive single-codebook VQ plus flow-matching refinement can sustain usable naturalness and identity at bitrates previously considered marginal, with potential relevance to bandwidth-constrained speech transmission.

major comments (2)

[front-end mel-spectrogram coding stage] Abstract and front-end mel-spectrogram coding stage: the assertion that the online clustering strategy 'prevents codebook collapse and preserve[s] codebook diversity' at the 640x compression ratio (25 codes/sec from a 1024-entry codebook) is presented without any supporting measurements (codeword utilization histograms, entropy statistics, or ablation removing the reassignment). This premise is load-bearing for the 250 bps claim, because collapse would render the subsequent CFM and HiFi-GAN stages unable to recover the reported quality.
[Abstract] Abstract: the claim that 'both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity' is stated without any numerical values, baseline names, dataset sizes, or significance tests. The absence of these data prevents independent assessment of whether the reported gains are real or merely consistent with the weakest-assumption premise above.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and will make the necessary revisions to strengthen the presentation of our claims.

read point-by-point responses

Referee: [front-end mel-spectrogram coding stage] Abstract and front-end mel-spectrogram coding stage: the assertion that the online clustering strategy 'prevents codebook collapse and preserve[s] codebook diversity' at the 640x compression ratio (25 codes/sec from a 1024-entry codebook) is presented without any supporting measurements (codeword utilization histograms, entropy statistics, or ablation removing the reassignment). This premise is load-bearing for the 250 bps claim, because collapse would render the subsequent CFM and HiFi-GAN stages unable to recover the reported quality.

Authors: We agree that the manuscript as submitted does not include explicit supporting measurements (such as codeword utilization histograms, entropy statistics, or an ablation removing the reassignment) for the online clustering strategy. This is a valid concern given the aggressive compression ratio. In the revised manuscript we will add these analyses in the front-end coding stage section to substantiate the claim that the strategy prevents collapse and maintains diversity. revision: yes
Referee: [Abstract] Abstract: the claim that 'both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity' is stated without any numerical values, baseline names, dataset sizes, or significance tests. The absence of these data prevents independent assessment of whether the reported gains are real or merely consistent with the weakest-assumption premise above.

Authors: The abstract is written to remain concise while summarizing the key contributions and outcomes. We acknowledge that the current wording lacks specific numerical values, baseline names, and dataset details. In the revision we will incorporate a small number of representative objective metrics and baseline references into the abstract (subject to length constraints) and will explicitly direct readers to the experimental section for full tables, dataset sizes, and any statistical significance tests performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluations

full rationale

The paper describes a three-stage CRR framework (mel-spectrogram VQ coding with online clustering, CFM refinement, HiFi-GAN reconstruction) and supports its performance claims solely through objective/subjective experiments on two datasets at 250/750 bps. No equations, derivations, or self-citations are presented that reduce any reported gain to a fitted parameter, self-definition, or prior author result by construction. The online clustering strategy is introduced as an engineering choice whose success is asserted via experimental outcomes rather than mathematical equivalence to the input data or model. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on several design choices and domain assumptions extracted from the abstract; no new physical entities are introduced.

free parameters (2)

VQ codebook size
Single 1024-entry codebook chosen for the coding stage to achieve 640x compression at target bitrates.
Compression ratio
Highly aggressive 640x encoder-decoder compression ratio is a fixed architectural choice.

axioms (2)

domain assumption Conditional flow matching with self-consistency training can refine codec-degraded mel-spectrograms while supporting reduced inference steps
Invoked to justify the refinement stage and its computational benefit.
domain assumption Online clustering prevents codebook collapse under extreme quantization
Stated as the mechanism preserving codebook diversity in the coding stage.

pith-pipeline@v0.9.1-grok · 5844 in / 1410 out tokens · 36175 ms · 2026-06-29T19:42:36.031518+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 6 canonical work pages · 2 internal anchors

[1]

High-quality, low-delay music coding in the opus codec,

J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. V os, “High-quality, low-delay music coding in the opus codec,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

2013
[2]

Overview of the EVS codec architecture,

M. Dietz, M. Multrus, V . Eksler, V . Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilacheet al., “Overview of the EVS codec architecture,” inProc. ICASSP, 2015, pp. 5698–5702

2015
[3]

Generative speech coding with predictive variance regularization,

W. B. Kleijn, A. Storus, M. Chinen, T. Denton, F. S. Lim, A. Luebs, J. Skoglund, and H. Yeh, “Generative speech coding with predictive variance regularization,” inProc. ICASSP, 2021, pp. 6478–6482

2021
[4]

Universal discrete-domain speech enhancement,

F. Liu, Y . Ai, Y .-X. Lu, R.-C. Zheng, H.-P. Du, and Z.-H. Ling, “Universal discrete-domain speech enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 285–298, 2026

2026
[5]

Speech enhancement using continuous embeddings of neural audio codec,

H. Li, J. Q. Yip, T. Fan, and E. S. Chng, “Speech enhancement using continuous embeddings of neural audio codec,” inProc. ICASSP, 2025, pp. 1–5

2025
[6]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 705–718, 2025

2025
[7]

V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,

S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

work page arXiv 2024
[8]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

SoundStream: An End-to-End Neural Audio Codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022

2022
[10]

High fidelity neural audio compression,

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Machine Learning Research, 2023

2023
[11]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Proc. NeurIPS, vol. 27, 2014

2014
[12]

High- fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved rvqgan,” inProc. NeurIPS, vol. 36, 2024

2024
[13]

BigCodec: Pushing the limits of low-bitrate neural speech codec,

D. Xin, X. Tan, S. Takamichi, and H. Saruwatari, “BigCodec: Pushing the limits of low-bitrate neural speech codec,”arXiv preprint arXiv:2409.05377, 2024

work page arXiv 2024
[14]

APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,

Y . Ai, X.-H. Jiang, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024

2024
[15]

MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,

X.-H. Jiang, Y . Ai, R.-C. Zheng, H.-P. Du, Y .-X. Lu, and Z.-H. Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” inProc. SLT, 2024, pp. 540–547

2024
[16]

A streamable neural audio codec with residual scalar-vector quantization for real-time communication,

X.-H. Jiang, Y . Ai, R.-C. Zheng, and Z.-H. Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,”IEEE Signal Processing Letters, vol. 32, pp. 1645– 1649, 2025

2025
[17]

Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,

R. Langman, A. Juki ´c, K. Dhawan, N. R. Koluguri, and B. Ginsburg, “Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,”arXiv preprint arXiv:2406.05298, 2024

work page arXiv 2024
[18]

WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Liet al., “WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inProc. ICLR, 2025

2025
[19]

Scaling transformers for low-bitrate high-quality speech coding,

J. D. Parker, A. Smirnov, J. Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inThe Thirteenth International Conference on Learning Representations
[20]

TS3-Codec: Transformer-based simple streaming single codec,

H. Wu, N. Kanda, S. Emre Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025, pp. 604–608

2025
[21]

FocalCodec: Low-bitrate speech coding via focal modulation networks,

L. Della Libera, F. Paissan, C. Subakan, and M. Ravanelli, “FocalCodec: Low-bitrate speech coding via focal modulation networks,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[22]

Semanticodec: An ultra low bitrate semantic audio codec for general sound,

H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumbley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1448–1461, 2024

2024
[23]

Flow matching for generative modeling,

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023

2023
[24]

FlowMAC: Conditional flow matching for audio coding at low bit rates,

N. Pia, M. Strauss, M. Multrus, and B. Edler, “FlowMAC: Conditional flow matching for audio coding at low bit rates,” inProc. ICASSP, 2025, pp. 1–5

2025
[25]

HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, vol. 33, 2020, pp. 17 022–17 033

2020
[26]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

2020
[27]

Generative de-quantization for neural speech codec via latent diffusion,

H. Yang, I. Jang, and M. Kim, “Generative de-quantization for neural speech codec via latent diffusion,” inProc. ICASSP. IEEE, 2024, pp. 1251–1255

2024
[28]

From discrete tokens to high-fidelity audio using multi- band diffusion,

R. San Roman, Y . Adi, A. Deleforge, R. Serizel, G. Synnaeve, and A. D ´efossez, “From discrete tokens to high-fidelity audio using multi- band diffusion,”Advances in neural information processing systems, vol. 36, pp. 1526–1538, 2023

2023
[29]

Matcha- TTS: A fast tts architecture with conditional flow matching,

S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha- TTS: A fast tts architecture with conditional flow matching,” inProc. ICASSP, 2024, pp. 11 341–11 345. 16

2024
[30]

FlowSE: Efficient and high-quality speech enhancement via flow matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and high-quality speech enhancement via flow matching,” inProc. Interspeech, 2025, pp. 4858–4862

2025
[31]

FlowDec: A flow-based full-band general audio codec with high perceptual quality,

S. Welker, M. Le, R. T. Chen, W.-N. Hsu, T. Gerkmann, A. Richard, and Y .-C. WU, “FlowDec: A flow-based full-band general audio codec with high perceptual quality,” inProc. ICLR, 2025

2025
[32]

Mucodec: Ultra low-bitrate music codec,

Y . Xu, H. Chen, J. Yu, W. Tan, R. Gu, S. Lei, Z. Lin, and Z. Wu, “Mucodec: Ultra low-bitrate music codec,”arXiv preprint arXiv:2409.13216, 2024

work page arXiv 2024
[33]

ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,” inProc. CVPR, 2023, pp. 16 133–16 142

2023
[34]

FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,

Y . Lv, H. Li, Y . Yan, J. Liu, D. Xie, and L. Xie, “FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,” inProc. Interspeech, 2024, pp. 3869–3873

2024
[35]

APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,

H.-P. Du, Y .-X. Lu, Y . Ai, and Z.-H. Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” inProc. NCMMSC, 2023, pp. 66–80

2023
[36]

V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,” inProc. ICLR, 2024

2024
[37]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

2020
[38]

Online clustered codebook,

C. Zheng and A. Vedaldi, “Online clustered codebook,” inProc. ICCV, 2023, pp. 22 798–22 807

2023
[39]

ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,

R.-C. Zheng, H.-P. Du, X.-H. Jiang, Y . Ai, and Z.-H. Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2539–2550, 2025

2025
[40]

Single-codec: Single-codebook speech codec towards high- performance speech generation,

H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, “Single-codec: Single-codebook speech codec towards high- performance speech generation,” inProc. Interspeech, 2024, pp. 3390– 3394

2024
[41]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

2017
[42]

Generating diverse high-fidelity images with vq-vae-2,

A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,”Advances in neural information processing systems, vol. 32, 2019

2019
[43]

Identity mappings in deep residual networks,

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”Computer Vision–ECCV 2016, vol. 9908, pp. 630–645, 2016

2016
[44]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[45]

BigVGAN: A universal neural vocoder with large-scale training,

S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” inProc. ICLR, 2023

2023
[46]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inProc. ICML, 2021, pp. 8599–8608

2021
[47]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[48]

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

J. Yamagishi, C. Veaux, K. MacDonaldet al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”University of Edinburgh. The Centre for Speech Technology Research (CSTR), pp. 271–350, 2019

2019
[49]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2017

2017
[50]

WavLM: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[51]

ViSQOL v3: An open source production ready objective speech and audio metric,

M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” inProc. QoMEX, 2020, pp. 1–6

2020
[52]

UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProc. Interspeech, 2022, pp. 4521–4525

2022
[53]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023, pp. 28 492–28 518

2023

[1] [1]

High-quality, low-delay music coding in the opus codec,

J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. V os, “High-quality, low-delay music coding in the opus codec,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

2013

[2] [2]

Overview of the EVS codec architecture,

M. Dietz, M. Multrus, V . Eksler, V . Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilacheet al., “Overview of the EVS codec architecture,” inProc. ICASSP, 2015, pp. 5698–5702

2015

[3] [3]

Generative speech coding with predictive variance regularization,

W. B. Kleijn, A. Storus, M. Chinen, T. Denton, F. S. Lim, A. Luebs, J. Skoglund, and H. Yeh, “Generative speech coding with predictive variance regularization,” inProc. ICASSP, 2021, pp. 6478–6482

2021

[4] [4]

Universal discrete-domain speech enhancement,

F. Liu, Y . Ai, Y .-X. Lu, R.-C. Zheng, H.-P. Du, and Z.-H. Ling, “Universal discrete-domain speech enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 285–298, 2026

2026

[5] [5]

Speech enhancement using continuous embeddings of neural audio codec,

H. Li, J. Q. Yip, T. Fan, and E. S. Chng, “Speech enhancement using continuous embeddings of neural audio codec,” inProc. ICASSP, 2025, pp. 1–5

2025

[6] [6]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 705–718, 2025

2025

[7] [7]

V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,

S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

work page arXiv 2024

[8] [8]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

SoundStream: An End-to-End Neural Audio Codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022

2022

[10] [10]

High fidelity neural audio compression,

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Machine Learning Research, 2023

2023

[11] [11]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Proc. NeurIPS, vol. 27, 2014

2014

[12] [12]

High- fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved rvqgan,” inProc. NeurIPS, vol. 36, 2024

2024

[13] [13]

BigCodec: Pushing the limits of low-bitrate neural speech codec,

D. Xin, X. Tan, S. Takamichi, and H. Saruwatari, “BigCodec: Pushing the limits of low-bitrate neural speech codec,”arXiv preprint arXiv:2409.05377, 2024

work page arXiv 2024

[14] [14]

APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,

Y . Ai, X.-H. Jiang, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024

2024

[15] [15]

MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,

X.-H. Jiang, Y . Ai, R.-C. Zheng, H.-P. Du, Y .-X. Lu, and Z.-H. Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” inProc. SLT, 2024, pp. 540–547

2024

[16] [16]

A streamable neural audio codec with residual scalar-vector quantization for real-time communication,

X.-H. Jiang, Y . Ai, R.-C. Zheng, and Z.-H. Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,”IEEE Signal Processing Letters, vol. 32, pp. 1645– 1649, 2025

2025

[17] [17]

Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,

R. Langman, A. Juki ´c, K. Dhawan, N. R. Koluguri, and B. Ginsburg, “Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,”arXiv preprint arXiv:2406.05298, 2024

work page arXiv 2024

[18] [18]

WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Liet al., “WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inProc. ICLR, 2025

2025

[19] [19]

Scaling transformers for low-bitrate high-quality speech coding,

J. D. Parker, A. Smirnov, J. Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inThe Thirteenth International Conference on Learning Representations

[20] [20]

TS3-Codec: Transformer-based simple streaming single codec,

H. Wu, N. Kanda, S. Emre Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025, pp. 604–608

2025

[21] [21]

FocalCodec: Low-bitrate speech coding via focal modulation networks,

L. Della Libera, F. Paissan, C. Subakan, and M. Ravanelli, “FocalCodec: Low-bitrate speech coding via focal modulation networks,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[22] [22]

Semanticodec: An ultra low bitrate semantic audio codec for general sound,

H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumbley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1448–1461, 2024

2024

[23] [23]

Flow matching for generative modeling,

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023

2023

[24] [24]

FlowMAC: Conditional flow matching for audio coding at low bit rates,

N. Pia, M. Strauss, M. Multrus, and B. Edler, “FlowMAC: Conditional flow matching for audio coding at low bit rates,” inProc. ICASSP, 2025, pp. 1–5

2025

[25] [25]

HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, vol. 33, 2020, pp. 17 022–17 033

2020

[26] [26]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

2020

[27] [27]

Generative de-quantization for neural speech codec via latent diffusion,

H. Yang, I. Jang, and M. Kim, “Generative de-quantization for neural speech codec via latent diffusion,” inProc. ICASSP. IEEE, 2024, pp. 1251–1255

2024

[28] [28]

From discrete tokens to high-fidelity audio using multi- band diffusion,

R. San Roman, Y . Adi, A. Deleforge, R. Serizel, G. Synnaeve, and A. D ´efossez, “From discrete tokens to high-fidelity audio using multi- band diffusion,”Advances in neural information processing systems, vol. 36, pp. 1526–1538, 2023

2023

[29] [29]

Matcha- TTS: A fast tts architecture with conditional flow matching,

S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha- TTS: A fast tts architecture with conditional flow matching,” inProc. ICASSP, 2024, pp. 11 341–11 345. 16

2024

[30] [30]

FlowSE: Efficient and high-quality speech enhancement via flow matching,

Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and high-quality speech enhancement via flow matching,” inProc. Interspeech, 2025, pp. 4858–4862

2025

[31] [31]

FlowDec: A flow-based full-band general audio codec with high perceptual quality,

S. Welker, M. Le, R. T. Chen, W.-N. Hsu, T. Gerkmann, A. Richard, and Y .-C. WU, “FlowDec: A flow-based full-band general audio codec with high perceptual quality,” inProc. ICLR, 2025

2025

[32] [32]

Mucodec: Ultra low-bitrate music codec,

Y . Xu, H. Chen, J. Yu, W. Tan, R. Gu, S. Lei, Z. Lin, and Z. Wu, “Mucodec: Ultra low-bitrate music codec,”arXiv preprint arXiv:2409.13216, 2024

work page arXiv 2024

[33] [33]

ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,” inProc. CVPR, 2023, pp. 16 133–16 142

2023

[34] [34]

FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,

Y . Lv, H. Li, Y . Yan, J. Liu, D. Xie, and L. Xie, “FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,” inProc. Interspeech, 2024, pp. 3869–3873

2024

[35] [35]

APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,

H.-P. Du, Y .-X. Lu, Y . Ai, and Z.-H. Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” inProc. NCMMSC, 2023, pp. 66–80

2023

[36] [36]

V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,” inProc. ICLR, 2024

2024

[37] [37]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

2020

[38] [38]

Online clustered codebook,

C. Zheng and A. Vedaldi, “Online clustered codebook,” inProc. ICCV, 2023, pp. 22 798–22 807

2023

[39] [39]

ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,

R.-C. Zheng, H.-P. Du, X.-H. Jiang, Y . Ai, and Z.-H. Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2539–2550, 2025

2025

[40] [40]

Single-codec: Single-codebook speech codec towards high- performance speech generation,

H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, “Single-codec: Single-codebook speech codec towards high- performance speech generation,” inProc. Interspeech, 2024, pp. 3390– 3394

2024

[41] [41]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

2017

[42] [42]

Generating diverse high-fidelity images with vq-vae-2,

A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,”Advances in neural information processing systems, vol. 32, 2019

2019

[43] [43]

Identity mappings in deep residual networks,

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”Computer Vision–ECCV 2016, vol. 9908, pp. 630–645, 2016

2016

[44] [44]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[45] [45]

BigVGAN: A universal neural vocoder with large-scale training,

S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” inProc. ICLR, 2023

2023

[46] [46]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inProc. ICML, 2021, pp. 8599–8608

2021

[47] [47]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[48] [48]

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

J. Yamagishi, C. Veaux, K. MacDonaldet al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”University of Edinburgh. The Centre for Speech Technology Research (CSTR), pp. 271–350, 2019

2019

[49] [49]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2017

2017

[50] [50]

WavLM: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[51] [51]

ViSQOL v3: An open source production ready objective speech and audio metric,

M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” inProc. QoMEX, 2020, pp. 1–6

2020

[52] [52]

UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProc. Interspeech, 2022, pp. 4521–4525

2022

[53] [53]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023, pp. 28 492–28 518

2023