pith. sign in

arxiv: 2605.25669 · v1 · pith:VEM2BCXVnew · submitted 2026-05-25 · 📡 eess.AS

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Pith reviewed 2026-06-29 19:42 UTC · model grok-4.3

classification 📡 eess.AS
keywords neural speech codingultra-low bitratemel-spectrogramflow matchingvector quantizationvocoderspeech reconstruction
0
0 comments X

The pith

FMelCodec reconstructs natural speech at 250 bps by coding mel-spectrograms with a single VQ codebook, refining them via conditional flow matching, and vocoding the result.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FMelCodec as a three-stage neural codec that works directly on mel-spectrograms to reach ultra-low bitrates. The front-end compresses by a factor of 640 using one 1024-entry vector-quantization codebook plus online clustering to avoid collapse. A lightweight conditional flow-matching stage then refines the degraded spectrogram, after which a HiFi-GAN vocoder produces the final waveform. Experiments at 250 bps for 16 kHz and 750 bps for 48 kHz report higher objective scores, better listener ratings, and greater speaker similarity than prior methods, all with reduced model size and compute.

Core claim

The CRR framework (mel-spectrogram coding with aggressive VQ and clustering, CFM-based refinement via a velocity-field estimator and self-consistency training, and vocoder-driven waveform reconstruction) produces higher-quality speech and speaker similarity at 250 bps for 16 kHz and 750 bps for 48 kHz while using lower computational and model complexity than existing ultra-low-bitrate codecs.

What carries the argument

The conditional flow matching refinement stage, which estimates a velocity field to correct the mel-spectrogram output by the vector-quantized decoder before vocoding.

If this is right

  • Speech remains intelligible and speaker-consistent at bitrates well below those of conventional codecs.
  • The refinement stage improves both quality metrics and speaker similarity without large added cost.
  • Model size and inference compute stay lower than competing neural codecs at the same bitrate.
  • The same pipeline works for both 16 kHz and 48 kHz sampling rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-consistency training that allows fewer flow-matching steps could lower latency for real-time applications.
  • If the clustering trick generalizes, similar single-codebook designs might help other extreme-compression generative tasks.
  • The mel-spectrogram domain plus refinement might transfer to coding music or environmental audio at comparable bitrates.

Load-bearing premise

The online clustering strategy prevents codebook collapse and keeps the 1024-entry codebook diverse enough for usable reconstruction at the 640x compression ratio.

What would settle it

Subjective listening tests at 250 bps in which FMelCodec receives equal or lower mean opinion scores and speaker similarity ratings than the strongest baseline codec.

Figures

Figures reproduced from arXiv: 2605.25669 by Hui-Peng Du, Xiao-Hang Jiang, Yang Ai, Yuan Tian, Zhen-Hua Ling.

Figure 1
Figure 1. Figure 1: Inference pipeline of the proposed FMelCodec under the CRR framework. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the ConvNeXt v2–based mel-spectrogram encoder and decoder used in the mel-spectrogram coding stage [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the velocity field estimation network [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spectrogram visualizations of natural speech, speech decoded by FMelCodec and the baseline methods, as well as speech reconstructed via HiFi-GAN [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ABX preference results on the 16 kHz LibriTTS test set, comparing [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the natural mel-spectrogram f [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ABX preference results on the 16 kHz LibriTTS test set, comparing FMelCodec at 250 bps with its ablated variants. “N/P” denotes no preference and p indicates the p-value of a paired t-test used for statistical analysis. sis–synthesis results (i.e., driven by natural mel-spectrograms) in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes FMelCodec, a three-stage CRR neural speech codec operating at 250 bps (16 kHz) and 750 bps (48 kHz). The front-end uses a 640x-compressed mel-spectrogram encoder-decoder with a single 1024-entry VQ codebook plus online clustering to avoid collapse; this is followed by a lightweight conditional flow-matching refinement stage with self-consistency training and a final HiFi-GAN vocoder. The central claim is that the system yields higher objective and subjective reconstruction quality and speaker similarity than prior methods while using lower model and computational complexity, validated on two datasets at the two sampling rates.

Significance. If the performance claims and the effectiveness of the online clustering at 640x compression are substantiated, the work would provide a concrete demonstration that aggressive single-codebook VQ plus flow-matching refinement can sustain usable naturalness and identity at bitrates previously considered marginal, with potential relevance to bandwidth-constrained speech transmission.

major comments (2)
  1. [front-end mel-spectrogram coding stage] Abstract and front-end mel-spectrogram coding stage: the assertion that the online clustering strategy 'prevents codebook collapse and preserve[s] codebook diversity' at the 640x compression ratio (25 codes/sec from a 1024-entry codebook) is presented without any supporting measurements (codeword utilization histograms, entropy statistics, or ablation removing the reassignment). This premise is load-bearing for the 250 bps claim, because collapse would render the subsequent CFM and HiFi-GAN stages unable to recover the reported quality.
  2. [Abstract] Abstract: the claim that 'both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity' is stated without any numerical values, baseline names, dataset sizes, or significance tests. The absence of these data prevents independent assessment of whether the reported gains are real or merely consistent with the weakest-assumption premise above.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and will make the necessary revisions to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [front-end mel-spectrogram coding stage] Abstract and front-end mel-spectrogram coding stage: the assertion that the online clustering strategy 'prevents codebook collapse and preserve[s] codebook diversity' at the 640x compression ratio (25 codes/sec from a 1024-entry codebook) is presented without any supporting measurements (codeword utilization histograms, entropy statistics, or ablation removing the reassignment). This premise is load-bearing for the 250 bps claim, because collapse would render the subsequent CFM and HiFi-GAN stages unable to recover the reported quality.

    Authors: We agree that the manuscript as submitted does not include explicit supporting measurements (such as codeword utilization histograms, entropy statistics, or an ablation removing the reassignment) for the online clustering strategy. This is a valid concern given the aggressive compression ratio. In the revised manuscript we will add these analyses in the front-end coding stage section to substantiate the claim that the strategy prevents collapse and maintains diversity. revision: yes

  2. Referee: [Abstract] Abstract: the claim that 'both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity' is stated without any numerical values, baseline names, dataset sizes, or significance tests. The absence of these data prevents independent assessment of whether the reported gains are real or merely consistent with the weakest-assumption premise above.

    Authors: The abstract is written to remain concise while summarizing the key contributions and outcomes. We acknowledge that the current wording lacks specific numerical values, baseline names, and dataset details. In the revision we will incorporate a small number of representative objective metrics and baseline references into the abstract (subject to length constraints) and will explicitly direct readers to the experimental section for full tables, dataset sizes, and any statistical significance tests performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluations

full rationale

The paper describes a three-stage CRR framework (mel-spectrogram VQ coding with online clustering, CFM refinement, HiFi-GAN reconstruction) and supports its performance claims solely through objective/subjective experiments on two datasets at 250/750 bps. No equations, derivations, or self-citations are presented that reduce any reported gain to a fitted parameter, self-definition, or prior author result by construction. The online clustering strategy is introduced as an engineering choice whose success is asserted via experimental outcomes rather than mathematical equivalence to the input data or model. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on several design choices and domain assumptions extracted from the abstract; no new physical entities are introduced.

free parameters (2)
  • VQ codebook size
    Single 1024-entry codebook chosen for the coding stage to achieve 640x compression at target bitrates.
  • Compression ratio
    Highly aggressive 640x encoder-decoder compression ratio is a fixed architectural choice.
axioms (2)
  • domain assumption Conditional flow matching with self-consistency training can refine codec-degraded mel-spectrograms while supporting reduced inference steps
    Invoked to justify the refinement stage and its computational benefit.
  • domain assumption Online clustering prevents codebook collapse under extreme quantization
    Stated as the mechanism preserving codebook diversity in the coding stage.

pith-pipeline@v0.9.1-grok · 5844 in / 1410 out tokens · 36175 ms · 2026-06-29T19:42:36.031518+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    High-quality, low-delay music coding in the opus codec,

    J.-M. Valin, G. Maxwell, T. B. Terriberry, and K. V os, “High-quality, low-delay music coding in the opus codec,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

  2. [2]

    Overview of the EVS codec architecture,

    M. Dietz, M. Multrus, V . Eksler, V . Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilacheet al., “Overview of the EVS codec architecture,” inProc. ICASSP, 2015, pp. 5698–5702

  3. [3]

    Generative speech coding with predictive variance regularization,

    W. B. Kleijn, A. Storus, M. Chinen, T. Denton, F. S. Lim, A. Luebs, J. Skoglund, and H. Yeh, “Generative speech coding with predictive variance regularization,” inProc. ICASSP, 2021, pp. 6478–6482

  4. [4]

    Universal discrete-domain speech enhancement,

    F. Liu, Y . Ai, Y .-X. Lu, R.-C. Zheng, H.-P. Du, and Z.-H. Ling, “Universal discrete-domain speech enhancement,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 285–298, 2026

  5. [5]

    Speech enhancement using continuous embeddings of neural audio codec,

    H. Li, J. Q. Yip, T. Fan, and E. S. Chng, “Speech enhancement using continuous embeddings of neural audio codec,” inProc. ICASSP, 2025, pp. 1–5

  6. [6]

    Neural codec language models are zero-shot text to speech synthesizers,

    S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 705–718, 2025

  7. [7]

    V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,

    S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “V ALL-E2: Neural codec language models are human parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

  8. [8]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  9. [9]

    SoundStream: An End-to-End Neural Audio Codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022

  10. [10]

    High fidelity neural audio compression,

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Machine Learning Research, 2023

  11. [11]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Proc. NeurIPS, vol. 27, 2014

  12. [12]

    High- fidelity audio compression with improved rvqgan,

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved rvqgan,” inProc. NeurIPS, vol. 36, 2024

  13. [13]

    BigCodec: Pushing the limits of low-bitrate neural speech codec,

    D. Xin, X. Tan, S. Takamichi, and H. Saruwatari, “BigCodec: Pushing the limits of low-bitrate neural speech codec,”arXiv preprint arXiv:2409.05377, 2024

  14. [14]

    APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,

    Y . Ai, X.-H. Jiang, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024

  15. [15]

    MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,

    X.-H. Jiang, Y . Ai, R.-C. Zheng, H.-P. Du, Y .-X. Lu, and Z.-H. Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” inProc. SLT, 2024, pp. 540–547

  16. [16]

    A streamable neural audio codec with residual scalar-vector quantization for real-time communication,

    X.-H. Jiang, Y . Ai, R.-C. Zheng, and Z.-H. Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,”IEEE Signal Processing Letters, vol. 32, pp. 1645– 1649, 2025

  17. [17]

    Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,

    R. Langman, A. Juki ´c, K. Dhawan, N. R. Koluguri, and B. Ginsburg, “Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis,”arXiv preprint arXiv:2406.05298, 2024

  18. [18]

    WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

    S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Liet al., “WavTokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inProc. ICLR, 2025

  19. [19]

    Scaling transformers for low-bitrate high-quality speech coding,

    J. D. Parker, A. Smirnov, J. Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inThe Thirteenth International Conference on Learning Representations

  20. [20]

    TS3-Codec: Transformer-based simple streaming single codec,

    H. Wu, N. Kanda, S. Emre Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025, pp. 604–608

  21. [21]

    FocalCodec: Low-bitrate speech coding via focal modulation networks,

    L. Della Libera, F. Paissan, C. Subakan, and M. Ravanelli, “FocalCodec: Low-bitrate speech coding via focal modulation networks,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  22. [22]

    Semanticodec: An ultra low bitrate semantic audio codec for general sound,

    H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumbley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1448–1461, 2024

  23. [23]

    Flow matching for generative modeling,

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023

  24. [24]

    FlowMAC: Conditional flow matching for audio coding at low bit rates,

    N. Pia, M. Strauss, M. Multrus, and B. Edler, “FlowMAC: Conditional flow matching for audio coding at low bit rates,” inProc. ICASSP, 2025, pp. 1–5

  25. [25]

    HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, vol. 33, 2020, pp. 17 022–17 033

  26. [26]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  27. [27]

    Generative de-quantization for neural speech codec via latent diffusion,

    H. Yang, I. Jang, and M. Kim, “Generative de-quantization for neural speech codec via latent diffusion,” inProc. ICASSP. IEEE, 2024, pp. 1251–1255

  28. [28]

    From discrete tokens to high-fidelity audio using multi- band diffusion,

    R. San Roman, Y . Adi, A. Deleforge, R. Serizel, G. Synnaeve, and A. D ´efossez, “From discrete tokens to high-fidelity audio using multi- band diffusion,”Advances in neural information processing systems, vol. 36, pp. 1526–1538, 2023

  29. [29]

    Matcha- TTS: A fast tts architecture with conditional flow matching,

    S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha- TTS: A fast tts architecture with conditional flow matching,” inProc. ICASSP, 2024, pp. 11 341–11 345. 16

  30. [30]

    FlowSE: Efficient and high-quality speech enhancement via flow matching,

    Z. Wang, Z. Liu, X. Zhu, Y . Zhu, M. Liu, J. Chen, L. Xiao, C. Weng, and L. Xie, “FlowSE: Efficient and high-quality speech enhancement via flow matching,” inProc. Interspeech, 2025, pp. 4858–4862

  31. [31]

    FlowDec: A flow-based full-band general audio codec with high perceptual quality,

    S. Welker, M. Le, R. T. Chen, W.-N. Hsu, T. Gerkmann, A. Richard, and Y .-C. WU, “FlowDec: A flow-based full-band general audio codec with high perceptual quality,” inProc. ICLR, 2025

  32. [32]

    Mucodec: Ultra low-bitrate music codec,

    Y . Xu, H. Chen, J. Yu, W. Tan, R. Gu, S. Lei, Z. Lin, and Z. Wu, “Mucodec: Ultra low-bitrate music codec,”arXiv preprint arXiv:2409.13216, 2024

  33. [33]

    ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders,” inProc. CVPR, 2023, pp. 16 133–16 142

  34. [34]

    FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,

    Y . Lv, H. Li, Y . Yan, J. Liu, D. Xie, and L. Xie, “FreeV: Free Lunch For V ocoders Through Pseudo Inversed Mel Filter,” inProc. Interspeech, 2024, pp. 3869–3873

  35. [35]

    APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,

    H.-P. Du, Y .-X. Lu, Y . Ai, and Z.-H. Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” inProc. NCMMSC, 2023, pp. 66–80

  36. [36]

    V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,

    H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier- based neural vocoders for high-quality audio synthesis,” inProc. ICLR, 2024

  37. [37]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

  38. [38]

    Online clustered codebook,

    C. Zheng and A. Vedaldi, “Online clustered codebook,” inProc. ICCV, 2023, pp. 22 798–22 807

  39. [39]

    ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,

    R.-C. Zheng, H.-P. Du, X.-H. Jiang, Y . Ai, and Z.-H. Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2539–2550, 2025

  40. [40]

    Single-codec: Single-codebook speech codec towards high- performance speech generation,

    H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, “Single-codec: Single-codebook speech codec towards high- performance speech generation,” inProc. Interspeech, 2024, pp. 3390– 3394

  41. [41]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

  42. [42]

    Generating diverse high-fidelity images with vq-vae-2,

    A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,”Advances in neural information processing systems, vol. 32, 2019

  43. [43]

    Identity mappings in deep residual networks,

    K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”Computer Vision–ECCV 2016, vol. 9908, pp. 630–645, 2016

  44. [44]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  45. [45]

    BigVGAN: A universal neural vocoder with large-scale training,

    S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” inProc. ICLR, 2023

  46. [46]

    Grad- TTS: A diffusion probabilistic model for text-to-speech,

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inProc. ICML, 2021, pp. 8599–8608

  47. [47]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

  48. [48]

    CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

    J. Yamagishi, C. Veaux, K. MacDonaldet al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”University of Edinburgh. The Centre for Speech Technology Research (CSTR), pp. 271–350, 2019

  49. [49]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2017

  50. [50]

    WavLM: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  51. [51]

    ViSQOL v3: An open source production ready objective speech and audio metric,

    M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” inProc. QoMEX, 2020, pp. 1–6

  52. [52]

    UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProc. Interspeech, 2022, pp. 4521–4525

  53. [53]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023, pp. 28 492–28 518