pith. machine review for the scientific record. sign in

arxiv: 2603.05373 · v2 · submitted 2026-03-05 · 💻 cs.SD · eess.AS

Recognition: 1 theorem link

· Lean Theorem

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:03 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords discrete speech synthesisspoof detectionhierarchical decodingneural codec modelszero-shot synthesistraining-free inferencemulti-resolution detectiontoken artifacts
0
0 comments X

The pith

MSpoof-TTS uses multi-resolution spoof detection to guide hierarchical decoding for improved zero-shot discrete speech synthesis without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural codec language models for speech synthesis often produce artifacts from token-level inconsistencies during inference. The paper introduces MSpoof-TTS as a training-free framework that applies multi-resolution token-based spoof detectors to evaluate sequences at different temporal scales. These detectors identify locally unnatural patterns and integrate into hierarchical decoding to prune poor candidates and re-rank better ones. This guidance improves perceptual quality and robustness while keeping the base model unchanged. A sympathetic reader would care because it provides a practical way to enhance existing synthesis systems at inference time only.

Core claim

We propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters.

What carries the argument

Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at varying temporal granularities and integrates with hierarchical decoding to prune and re-rank hypotheses.

If this is right

  • Enhances robustness to token-level artifacts in discrete speech synthesis.
  • Improves perceptual realism in zero-shot generation without model retraining.
  • Progressively prunes low-quality candidates during decoding.
  • Re-ranks hypotheses using multi-resolution discriminator scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might generalize to other autoregressive token models in audio or language domains.
  • Combining multi-scale detection with existing alignment techniques could further reduce artifacts.
  • Testable extension: apply the same detectors to music generation tasks using similar codecs.

Load-bearing premise

The multi-resolution spoof detectors can reliably spot unnatural token patterns at different scales without eliminating too many valid high-quality synthesis options.

What would settle it

Conducting perceptual listening tests or using objective metrics like MOS on samples generated with and without the MSpoof-TTS framework to check if quality improves or stays the same.

Figures

Figures reproduced from arXiv: 2603.05373 by Junchuan Zhao, Minh Duc Vu, Ye Wang.

Figure 1
Figure 1. Figure 1: Overview of multi-resolution token-based spoof detection framework. (a) Construction of token sequences at multiple tempo￾ral resolutions for training separate real/fake detectors. (b) Conformer-based discrete token spoof detector architecture. Golden Tokens Synthetic Tokens (a) Full Utterance Golden Tokens Synthetic Tokens (b) Segment Length = 50 Golden Tokens Synthetic Tokens (c) Segment Length = 25 Gold… view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of embedding distributions under different segment lengths. a multi-resolution authenticity modeling approach tailored to discrete codec sequences. • We develop an inference-time decoding strategy that lever￾ages spoof-based authenticity scores for candidate pruning and reranking, without retraining the base codec language model. • We demonstrate consistent improvements in perceptual qu… view at source ↗
Figure 3
Figure 3. Figure 3: Subjective evaluation of different inference strate￾gies measured by MOS-N (naturalness), MOS-Q (quality), and SMOS (similarity) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples are available at https://danny-nus.github.io/MSpoofTTS.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MSpoof-TTS, a training-free inference framework for improving zero-shot discrete speech synthesis using neural codec language models. It introduces a Multi-Resolution Token-based Spoof Detection framework that evaluates codec token sequences at multiple temporal granularities to detect locally inconsistent or unnatural patterns. These detectors are integrated into a hierarchical decoding strategy that progressively prunes low-quality candidates and re-ranks hypotheses, with the goal of enhancing perceptual robustness without modifying model parameters or requiring retraining. The abstract states that experiments validate the framework's effectiveness for robust, high-quality codec-based speech generation, with audio samples linked.

Significance. If the multi-resolution detectors reliably flag artifacts while preserving diverse but valid token sequences, the approach would offer a practical, parameter-free method to mitigate token-level artifacts in discrete speech synthesis. This could be useful for zero-shot settings where retraining or preference optimization is costly. The training-free nature is a potential strength, but the current lack of quantitative support makes it difficult to determine whether the result would meaningfully advance the field beyond existing hierarchical or guided decoding techniques.

major comments (2)
  1. [Abstract] Abstract: The claim that 'Experiments validate the effectiveness of our framework' is unsupported by any quantitative results, baseline comparisons, ablation studies, detector precision/recall metrics, or pruning ratios, which directly undermines the central empirical assertion of improved robustness.
  2. [Multi-Resolution Token-based Spoof Detection framework] Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding description: No analysis is provided on false-positive rates of the detectors on natural prosodic or speaker-specific variations, nor on the fraction of hypotheses pruned at each resolution; without this, it is impossible to confirm that the strategy yields a net gain rather than discarding high-quality paths.
minor comments (1)
  1. [Abstract] The abstract mentions audio samples at a GitHub link but provides no description of what specific improvements (e.g., reduced artifacts in particular conditions) listeners should expect to hear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will incorporate quantitative analyses and detector evaluations in the revised manuscript to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Experiments validate the effectiveness of our framework' is unsupported by any quantitative results, baseline comparisons, ablation studies, detector precision/recall metrics, or pruning ratios, which directly undermines the central empirical assertion of improved robustness.

    Authors: We agree that the current version relies on qualitative audio demonstrations rather than quantitative metrics, which weakens the abstract's claim. In revision we will add objective evaluations including MOS listening tests, comparisons to standard hierarchical decoding and other baselines, ablation studies on each resolution level, detector precision/recall on held-out spoof and natural data, and pruning-ratio statistics at every stage of the hierarchy. revision: yes

  2. Referee: [Multi-Resolution Token-based Spoof Detection framework] Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding description: No analysis is provided on false-positive rates of the detectors on natural prosodic or speaker-specific variations, nor on the fraction of hypotheses pruned at each resolution; without this, it is impossible to confirm that the strategy yields a net gain rather than discarding high-quality paths.

    Authors: We acknowledge the need for explicit false-positive and pruning analysis. The detectors target local token-level inconsistencies rather than prosody, yet we will add new experiments in revision that measure false-positive rates on natural speech with varied prosody and speakers, report per-resolution pruning fractions, and correlate these with perceptual quality gains to demonstrate net benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework components and experimental validation are independent of inputs

full rationale

The paper introduces MSpoof-TTS as a training-free inference method with a newly defined Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding strategy. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The derivation chain consists of proposing detectors at multiple granularities, integrating them for pruning, and validating via experiments, all of which are externally falsifiable and not self-referential. This matches the default expectation of a non-circular proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that newly introduced multi-resolution spoof detectors can be built and will function as intended for pruning; no free parameters or external benchmarks are mentioned in the abstract.

axioms (1)
  • domain assumption Multi-resolution evaluation of codec token sequences can detect locally inconsistent or unnatural patterns
    Invoked when the paper states the detectors evaluate sequences at different temporal granularities to identify artifacts.
invented entities (1)
  • Multi-Resolution Token-based Spoof Detection framework no independent evidence
    purpose: To evaluate codec sequences at different temporal granularities and detect inconsistent patterns for guiding decoding
    Newly proposed component integrated into the hierarchical strategy

pith-pipeline@v0.9.0 · 5437 in / 1273 out tokens · 74019 ms · 2026-05-15T15:03:59.069404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Neural codec language models have recently become a prac- tical and effective approach for zero-shot speech synthesis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. By modeling speech as sequences of discrete codec tokens with autoregressive or transformer ar- chitectures, these systems streamline the synthesis pipeline and naturally adopt scalable decodin...

  2. [2]

    Overview Pretrained codec-based language models have demonstrated strong capability in zero-shot speech synthesis by modeling dis- crete codec token sequences autoregressively

    Method 2.1. Overview Pretrained codec-based language models have demonstrated strong capability in zero-shot speech synthesis by modeling dis- crete codec token sequences autoregressively. In this work, we adopt NeuTTS1 , a pretrained codec-based TTS system, as the base generator, whose parameters remain fixed throughout our framework. While such models a...

  3. [3]

    Datasets For spoof detection training, we use the LibriTTS [35] training split (approximately 100 hours of clean read English speech)

    Experiments 3.1. Datasets For spoof detection training, we use the LibriTTS [35] training split (approximately 100 hours of clean read English speech). LibriTTS is a multi-speaker corpus with careful segmentation and noise filtering. For each ground-truth utterance, we gener- ate three synthetic counterparts using the same transcript but a different refer...

  4. [4]

    Results 4.1. Evaluation on Standard Benchmarks We compare several inference strategies, including the default top-k sampling decoder (Original), repetition-aware sampling (RAS), entropy-aware sampling (EAS), and their hierarchical extensions (HierRAS and HierEAS). HierRAS and HierEAS in- tegrate the proposed hierarchical spoof-guided sampling frame- work ...

  5. [5]

    We use multi-resolution spoof detectors to guide decoding and suppress locally inconsistent codec token patterns, without modifying the pretrained speech language model

    Conclusion We present MSpoofTTS, a training-free framework that im- proves discrete speech synthesis through hierarchical spoof- guided inference. We use multi-resolution spoof detectors to guide decoding and suppress locally inconsistent codec token patterns, without modifying the pretrained speech language model. Experiments on LibriTTS and LibriSpeech ...

  6. [6]

    Naturalspeech: End-to-end text-to- speech synthesis with human-level quality,

    X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. Heet al., “Naturalspeech: End-to-end text-to- speech synthesis with human-level quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

  7. [7]

    Naturalspeech 3: zero-shot speech syn- thesis with factorized codec and diffusion models,

    Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: zero-shot speech syn- thesis with factorized codec and diffusion models,” inInterna- tional Conference on Machine Learning, ICML 2024, vol. 235, 2024, pp. 22 605–22 623

  8. [8]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

  9. [9]

    Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,

    Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

  10. [10]

    Autoregressive speech synthesis without vector quantization,

    L. Meng, L. Zhou, S. Liu, S. Chen, B. Han, S. Hu, Y . Liu, J. Li, S. Zhao, X. Wuet al., “Autoregressive speech synthesis without vector quantization,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 1287–1300

  11. [11]

    Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

    Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,” inThe Thir- teenth International Conference on Learning Representations, 2025

  12. [12]

    Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering,

    Y . Song, Z. Chen, X. Wang, Z. Ma, and X. Chen, “Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 174–25 182

  13. [13]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

  14. [14]

    Prosody-adaptable audio codecs for zero-shot voice conversion via in-context learning,

    J. Zhao, X. Wang, and Y . Wang, “Prosody-adaptable audio codecs for zero-shot voice conversion via in-context learning,” inInter- speech 2025, 2025, pp. 4893–4897

  15. [15]

    Segment-aware conditioning for training-free intra-utterance emotion and duration control in text-to-speech,

    Q. Liang, Y . Liu, R. Wei, N. Lu, J. Zhao, and Y . Wang, “Segment-aware conditioning for training-free intra-utterance emotion and duration control in text-to-speech,”arXiv preprint arXiv:2601.03170, 2026

  16. [16]

    Speechalign: Aligning speech generation to human preferences,

    D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 343–50 360, 2024

  17. [17]

    Robust zero- shot text-to-speech synthesis with reverse inference optimization,

    Y . Hu, C. Chen, S. Wang, E. S. Chng, and C. Zhang, “Robust zero- shot text-to-speech synthesis with reverse inference optimization,” arXiv preprint arXiv:2407.02243, 2024

  18. [18]

    Differentiable reward optimization for llm based tts system,

    C. Gao, Z. Du, and S. Zhang, “Differentiable reward optimization for llm based tts system,” inInterspeech 2025, 2025, pp. 2450– 2454

  19. [19]

    En- hancing zero-shot text-to-speech synthesis with human feedback,

    C. Chen, Y . Hu, W. Wu, H. Wang, E. S. Chng, and C. Zhang, “En- hancing zero-shot text-to-speech synthesis with human feedback,” arXiv preprint arXiv:2406.00654, 2024

  20. [20]

    Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,

    J. Zhao, W. Zeng, T. Lyu, and Y . Wang, “Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,”IEEE Transactions on Audio, Speech and Language Processing, 2026

  21. [21]

    Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,

    S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

  22. [22]

    Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment,

    B. Han, L. Zhou, S. Liu, S. Chen, L. Meng, Y . Qian, Y . Liu, S. Zhao, J. Li, and F. Wei, “Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment,”arXiv preprint arXiv:2406.07855, 2024

  23. [23]

    Dexperts: Decoding-time controlled text generation with experts and anti-experts,

    A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi, “Dexperts: Decoding-time controlled text generation with experts and anti-experts,” inProceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),...

  24. [24]

    Plug and play language models: A sim- ple approach to controlled text generation,

    S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A sim- ple approach to controlled text generation,” inThe Eighth Inter- national Conference on Learning Representations, 2020

  25. [25]

    Toward a universal synthetic speech spoofing detec- tion using phase information,

    J. Sanchez, I. Saratxaga, I. Hernaez, E. Navas, D. Erro, and T. Raitio, “Toward a universal synthetic speech spoofing detec- tion using phase information,”IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 810–820, 2015

  26. [26]

    One-class learning towards syn- thetic voice spoofing detection,

    Y . Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn- thetic voice spoofing detection,”IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021

  27. [27]

    Generalizable speech spoofing detection against silence trimming with data augmen- tation and multi-task meta-learning,

    L. Wang, L. Yu, Y . Zhang, and H. Xie, “Generalizable speech spoofing detection against silence trimming with data augmen- tation and multi-task meta-learning,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3296–3310, 2024

  28. [28]

    Contrastive learning-based speech spoofing detection for multimedia security in edge intelligence,

    J. Sun, X. Deng, S. Liu, X. Fan, Y . Huang, Y . He, C. Wu, and J. Park, “Contrastive learning-based speech spoofing detection for multimedia security in edge intelligence,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, no. 8, pp. 1–21, 2025

  29. [29]

    V oice spoof- ing detector: A unified anti-spoofing framework,

    A. Javed, K. M. Malik, H. Malik, and A. Irtaza, “V oice spoof- ing detector: A unified anti-spoofing framework,”Expert Systems with Applications, vol. 198, p. 116770, 2022

  30. [30]

    Asvspoof: The automatic speaker verification spoofing and countermeasures challenge,

    Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc ¸i, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017

  31. [31]

    Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,

    M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,” inInterspeech 2019, 2019, pp. 1008–1012

  32. [32]

    Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 31, p. 2507–2522, 2023

  33. [33]

    ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

    X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

  34. [34]

    Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,

    H. Wu, Y . Tseng, and H.-y. Lee, “Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,” inInterspeech 2024, 2024, pp. 1770– 1774

  35. [35]

    Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,

    X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

  36. [36]

    Codecfake-omni: A large-scale codec-based deepfake speech dataset,

    J. Du, X. Chen, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake-omni: A large-scale codec-based deepfake speech dataset,”arXiv e-prints, pp. arXiv– 2501, 2025

  37. [37]

    Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,”Advances in neural information processing systems, vol. 33, pp. 17 022– 17 033, 2020

  38. [38]

    Bigvgan: A universal neural vocoder with large-scale training,

    S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” inThe Eleventh International Conference on Learning Represen- tations, 2023

  39. [39]

    Conformer: Convolution- augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

  40. [40]

    Libritts: A corpus derived from librispeech for text- to-speech,

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inInterspeech 2019, 2019, pp. 1526–1530

  41. [41]

    Lib- rispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 5206–5210

  42. [42]

    TwistList: Resources and baselines for tongue twister generation,

    T. Loakman, C. Tang, and C. Lin, “TwistList: Resources and baselines for tongue twister generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), 2023, pp. 579–589

  43. [43]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInternational Conference on Machine Learning, ICML 2023, 2023, pp. 28 492–28 518

  44. [44]

    Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  45. [45]

    Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,” inInterspeech 2021, 2021, pp. 2127–2131

  46. [46]

    Mosnet: Deep learning-based objec- tive assessment for voice conversion,

    C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “Mosnet: Deep learning-based objec- tive assessment for voice conversion,”Interspeech 2019, 2019