arxiv: 2603.05373 · v2 · submitted 2026-03-05 · 💻 cs.SD · eess.AS

Recognition: 1 theorem link

· Lean Theorem

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Junchuan Zhao , Minh Duc Vu , Ye Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:03 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords discrete speech synthesisspoof detectionhierarchical decodingneural codec modelszero-shot synthesistraining-free inferencemulti-resolution detectiontoken artifacts

0 comments

The pith

MSpoof-TTS uses multi-resolution spoof detection to guide hierarchical decoding for improved zero-shot discrete speech synthesis without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural codec language models for speech synthesis often produce artifacts from token-level inconsistencies during inference. The paper introduces MSpoof-TTS as a training-free framework that applies multi-resolution token-based spoof detectors to evaluate sequences at different temporal scales. These detectors identify locally unnatural patterns and integrate into hierarchical decoding to prune poor candidates and re-rank better ones. This guidance improves perceptual quality and robustness while keeping the base model unchanged. A sympathetic reader would care because it provides a practical way to enhance existing synthesis systems at inference time only.

Core claim

We propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters.

What carries the argument

Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at varying temporal granularities and integrates with hierarchical decoding to prune and re-rank hypotheses.

If this is right

Enhances robustness to token-level artifacts in discrete speech synthesis.
Improves perceptual realism in zero-shot generation without model retraining.
Progressively prunes low-quality candidates during decoding.
Re-ranks hypotheses using multi-resolution discriminator scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might generalize to other autoregressive token models in audio or language domains.
Combining multi-scale detection with existing alignment techniques could further reduce artifacts.
Testable extension: apply the same detectors to music generation tasks using similar codecs.

Load-bearing premise

The multi-resolution spoof detectors can reliably spot unnatural token patterns at different scales without eliminating too many valid high-quality synthesis options.

What would settle it

Conducting perceptual listening tests or using objective metrics like MOS on samples generated with and without the MSpoof-TTS framework to check if quality improves or stays the same.

Figures

Figures reproduced from arXiv: 2603.05373 by Junchuan Zhao, Minh Duc Vu, Ye Wang.

**Figure 1.** Figure 1: Overview of multi-resolution token-based spoof detection framework. (a) Construction of token sequences at multiple temporal resolutions for training separate real/fake detectors. (b) Conformer-based discrete token spoof detector architecture. Golden Tokens Synthetic Tokens (a) Full Utterance Golden Tokens Synthetic Tokens (b) Segment Length = 50 Golden Tokens Synthetic Tokens (c) Segment Length = 25 Gold… view at source ↗

**Figure 2.** Figure 2: t-SNE visualization of embedding distributions under different segment lengths. a multi-resolution authenticity modeling approach tailored to discrete codec sequences. • We develop an inference-time decoding strategy that leverages spoof-based authenticity scores for candidate pruning and reranking, without retraining the base codec language model. • We demonstrate consistent improvements in perceptual qu… view at source ↗

**Figure 3.** Figure 3: Subjective evaluation of different inference strategies measured by MOS-N (naturalness), MOS-Q (quality), and SMOS (similarity) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples are available at https://danny-nus.github.io/MSpoofTTS.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a training-free multi-resolution spoof detector to prune token sequences during hierarchical decoding for codec TTS, but the abstract gives no numbers or ablations to show it actually works without discarding good paths.

read the letter

The core idea is straightforward: run spoof detectors at several time scales on the codec tokens, then use their scores to prune weak candidates step by step in the decoder. This keeps the base model untouched and targets the local artifacts that show up in zero-shot synthesis. The combination of multi-resolution detection with progressive pruning looks like a new practical step for this setup, and it directly addresses a known pain point in neural codec language models without needing preference tuning or retraining. That part is cleanly motivated and could be useful for anyone shipping these models in real applications. The hierarchical structure also seems like a reasonable way to keep compute under control while still re-ranking. The main weakness is the complete absence of evidence. The abstract states that experiments validate the approach, yet supplies no precision or recall figures for the detectors, no pruning ratios, no baseline comparisons, and no ablation that separates the spoof guidance from the search itself. Without those numbers it is impossible to judge whether the detectors are catching only the bad patterns or also trimming valid prosodic variation. The stress-test concern about over-pruning therefore stands until data appears. This work is aimed at people already building or deploying discrete speech synthesis systems who need inference-time robustness fixes. A reader in that group would find the framing useful even if they end up testing the detectors themselves. I would send the paper to peer review once the full version includes the missing quantitative results and controls; the idea is worth referee time if the experiments hold up.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MSpoof-TTS, a training-free inference framework for improving zero-shot discrete speech synthesis using neural codec language models. It introduces a Multi-Resolution Token-based Spoof Detection framework that evaluates codec token sequences at multiple temporal granularities to detect locally inconsistent or unnatural patterns. These detectors are integrated into a hierarchical decoding strategy that progressively prunes low-quality candidates and re-ranks hypotheses, with the goal of enhancing perceptual robustness without modifying model parameters or requiring retraining. The abstract states that experiments validate the framework's effectiveness for robust, high-quality codec-based speech generation, with audio samples linked.

Significance. If the multi-resolution detectors reliably flag artifacts while preserving diverse but valid token sequences, the approach would offer a practical, parameter-free method to mitigate token-level artifacts in discrete speech synthesis. This could be useful for zero-shot settings where retraining or preference optimization is costly. The training-free nature is a potential strength, but the current lack of quantitative support makes it difficult to determine whether the result would meaningfully advance the field beyond existing hierarchical or guided decoding techniques.

major comments (2)

[Abstract] Abstract: The claim that 'Experiments validate the effectiveness of our framework' is unsupported by any quantitative results, baseline comparisons, ablation studies, detector precision/recall metrics, or pruning ratios, which directly undermines the central empirical assertion of improved robustness.
[Multi-Resolution Token-based Spoof Detection framework] Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding description: No analysis is provided on false-positive rates of the detectors on natural prosodic or speaker-specific variations, nor on the fraction of hypotheses pruned at each resolution; without this, it is impossible to confirm that the strategy yields a net gain rather than discarding high-quality paths.

minor comments (1)

[Abstract] The abstract mentions audio samples at a GitHub link but provides no description of what specific improvements (e.g., reduced artifacts in particular conditions) listeners should expect to hear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will incorporate quantitative analyses and detector evaluations in the revised manuscript to strengthen the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experiments validate the effectiveness of our framework' is unsupported by any quantitative results, baseline comparisons, ablation studies, detector precision/recall metrics, or pruning ratios, which directly undermines the central empirical assertion of improved robustness.

Authors: We agree that the current version relies on qualitative audio demonstrations rather than quantitative metrics, which weakens the abstract's claim. In revision we will add objective evaluations including MOS listening tests, comparisons to standard hierarchical decoding and other baselines, ablation studies on each resolution level, detector precision/recall on held-out spoof and natural data, and pruning-ratio statistics at every stage of the hierarchy. revision: yes
Referee: [Multi-Resolution Token-based Spoof Detection framework] Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding description: No analysis is provided on false-positive rates of the detectors on natural prosodic or speaker-specific variations, nor on the fraction of hypotheses pruned at each resolution; without this, it is impossible to confirm that the strategy yields a net gain rather than discarding high-quality paths.

Authors: We acknowledge the need for explicit false-positive and pruning analysis. The detectors target local token-level inconsistencies rather than prosody, yet we will add new experiments in revision that measure false-positive rates on natural speech with varied prosody and speakers, report per-resolution pruning fractions, and correlate these with perceptual quality gains to demonstrate net benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework components and experimental validation are independent of inputs

full rationale

The paper introduces MSpoof-TTS as a training-free inference method with a newly defined Multi-Resolution Token-based Spoof Detection framework and hierarchical decoding strategy. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The derivation chain consists of proposing detectors at multiple granularities, integrating them for pruning, and validating via experiments, all of which are externally falsifiable and not self-referential. This matches the default expectation of a non-circular proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that newly introduced multi-resolution spoof detectors can be built and will function as intended for pruning; no free parameters or external benchmarks are mentioned in the abstract.

axioms (1)

domain assumption Multi-resolution evaluation of codec token sequences can detect locally inconsistent or unnatural patterns
Invoked when the paper states the detectors evaluate sequences at different temporal granularities to identify artifacts.

invented entities (1)

Multi-Resolution Token-based Spoof Detection framework no independent evidence
purpose: To evaluate codec sequences at different temporal granularities and detect inconsistent patterns for guiding decoding
Newly proposed component integrated into the hierarchical strategy

pith-pipeline@v0.9.0 · 5437 in / 1273 out tokens · 74019 ms · 2026-05-15T15:03:59.069404+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

Introduction Neural codec language models have recently become a prac- tical and effective approach for zero-shot speech synthesis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. By modeling speech as sequences of discrete codec tokens with autoregressive or transformer ar- chitectures, these systems streamline the synthesis pipeline and naturally adopt scalable decodin...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Overview Pretrained codec-based language models have demonstrated strong capability in zero-shot speech synthesis by modeling dis- crete codec token sequences autoregressively

Method 2.1. Overview Pretrained codec-based language models have demonstrated strong capability in zero-shot speech synthesis by modeling dis- crete codec token sequences autoregressively. In this work, we adopt NeuTTS1 , a pretrained codec-based TTS system, as the base generator, whose parameters remain fixed throughout our framework. While such models a...

work page
[3]

Datasets For spoof detection training, we use the LibriTTS [35] training split (approximately 100 hours of clean read English speech)

Experiments 3.1. Datasets For spoof detection training, we use the LibriTTS [35] training split (approximately 100 hours of clean read English speech). LibriTTS is a multi-speaker corpus with careful segmentation and noise filtering. For each ground-truth utterance, we gener- ate three synthetic counterparts using the same transcript but a different refer...

work page
[4]

Results 4.1. Evaluation on Standard Benchmarks We compare several inference strategies, including the default top-k sampling decoder (Original), repetition-aware sampling (RAS), entropy-aware sampling (EAS), and their hierarchical extensions (HierRAS and HierEAS). HierRAS and HierEAS in- tegrate the proposed hierarchical spoof-guided sampling frame- work ...

work page arXiv 1937
[5]

We use multi-resolution spoof detectors to guide decoding and suppress locally inconsistent codec token patterns, without modifying the pretrained speech language model

Conclusion We present MSpoofTTS, a training-free framework that im- proves discrete speech synthesis through hierarchical spoof- guided inference. We use multi-resolution spoof detectors to guide decoding and suppress locally inconsistent codec token patterns, without modifying the pretrained speech language model. Experiments on LibriTTS and LibriSpeech ...

work page
[6]

Naturalspeech: End-to-end text-to- speech synthesis with human-level quality,

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. Heet al., “Naturalspeech: End-to-end text-to- speech synthesis with human-level quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

work page 2024
[7]

Naturalspeech 3: zero-shot speech syn- thesis with factorized codec and diffusion models,

Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: zero-shot speech syn- thesis with factorized codec and diffusion models,” inInterna- tional Conference on Machine Learning, ICML 2024, vol. 235, 2024, pp. 22 605–22 623

work page 2024
[8]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review arXiv 2024
[9]

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,

Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

work page arXiv 2025
[10]

Autoregressive speech synthesis without vector quantization,

L. Meng, L. Zhou, S. Liu, S. Chen, B. Han, S. Hu, Y . Liu, J. Li, S. Zhao, X. Wuet al., “Autoregressive speech synthesis without vector quantization,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 1287–1300

work page 2025
[11]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,” inThe Thir- teenth International Conference on Learning Representations, 2025

work page 2025
[12]

Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering,

Y . Song, Z. Chen, X. Wang, Z. Ma, and X. Chen, “Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 174–25 182

work page 2025
[13]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

work page 2025
[14]

Prosody-adaptable audio codecs for zero-shot voice conversion via in-context learning,

J. Zhao, X. Wang, and Y . Wang, “Prosody-adaptable audio codecs for zero-shot voice conversion via in-context learning,” inInter- speech 2025, 2025, pp. 4893–4897

work page 2025
[15]

Segment-aware conditioning for training-free intra-utterance emotion and duration control in text-to-speech,

Q. Liang, Y . Liu, R. Wei, N. Lu, J. Zhao, and Y . Wang, “Segment-aware conditioning for training-free intra-utterance emotion and duration control in text-to-speech,”arXiv preprint arXiv:2601.03170, 2026

work page arXiv 2026
[16]

Speechalign: Aligning speech generation to human preferences,

D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 343–50 360, 2024

work page 2024
[17]

Robust zero- shot text-to-speech synthesis with reverse inference optimization,

Y . Hu, C. Chen, S. Wang, E. S. Chng, and C. Zhang, “Robust zero- shot text-to-speech synthesis with reverse inference optimization,” arXiv preprint arXiv:2407.02243, 2024

work page arXiv 2024
[18]

Differentiable reward optimization for llm based tts system,

C. Gao, Z. Du, and S. Zhang, “Differentiable reward optimization for llm based tts system,” inInterspeech 2025, 2025, pp. 2450– 2454

work page 2025
[19]

En- hancing zero-shot text-to-speech synthesis with human feedback,

C. Chen, Y . Hu, W. Wu, H. Wang, E. S. Chng, and C. Zhang, “En- hancing zero-shot text-to-speech synthesis with human feedback,” arXiv preprint arXiv:2406.00654, 2024

work page arXiv 2024
[20]

Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,

J. Zhao, W. Zeng, T. Lyu, and Y . Wang, “Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,”IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026
[21]

Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,

S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

work page arXiv 2024
[22]

Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment,

B. Han, L. Zhou, S. Liu, S. Chen, L. Meng, Y . Qian, Y . Liu, S. Zhao, J. Li, and F. Wei, “Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment,”arXiv preprint arXiv:2406.07855, 2024

work page arXiv 2024
[23]

Dexperts: Decoding-time controlled text generation with experts and anti-experts,

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi, “Dexperts: Decoding-time controlled text generation with experts and anti-experts,” inProceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),...

work page 2021
[24]

Plug and play language models: A sim- ple approach to controlled text generation,

S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A sim- ple approach to controlled text generation,” inThe Eighth Inter- national Conference on Learning Representations, 2020

work page 2020
[25]

Toward a universal synthetic speech spoofing detec- tion using phase information,

J. Sanchez, I. Saratxaga, I. Hernaez, E. Navas, D. Erro, and T. Raitio, “Toward a universal synthetic speech spoofing detec- tion using phase information,”IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 810–820, 2015

work page 2015
[26]

One-class learning towards syn- thetic voice spoofing detection,

Y . Zhang, F. Jiang, and Z. Duan, “One-class learning towards syn- thetic voice spoofing detection,”IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021

work page 2021
[27]

Generalizable speech spoofing detection against silence trimming with data augmen- tation and multi-task meta-learning,

L. Wang, L. Yu, Y . Zhang, and H. Xie, “Generalizable speech spoofing detection against silence trimming with data augmen- tation and multi-task meta-learning,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3296–3310, 2024

work page 2024
[28]

Contrastive learning-based speech spoofing detection for multimedia security in edge intelligence,

J. Sun, X. Deng, S. Liu, X. Fan, Y . Huang, Y . He, C. Wu, and J. Park, “Contrastive learning-based speech spoofing detection for multimedia security in edge intelligence,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, no. 8, pp. 1–21, 2025

work page 2025
[29]

V oice spoof- ing detector: A unified anti-spoofing framework,

A. Javed, K. M. Malik, H. Malik, and A. Irtaza, “V oice spoof- ing detector: A unified anti-spoofing framework,”Expert Systems with Applications, vol. 198, p. 116770, 2022

work page 2022
[30]

Asvspoof: The automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc ¸i, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017

work page 2017
[31]

Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio de- tection,” inInterspeech 2019, 2019, pp. 1008–1012

work page 2019
[32]

Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 31, p. 2507–2522, 2023

work page 2021
[33]

ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

work page 2024
[34]

Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,

H. Wu, Y . Tseng, and H.-y. Lee, “Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,” inInterspeech 2024, 2024, pp. 1770– 1774

work page 2024
[35]

Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,

X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

work page arXiv 2025
[36]

Codecfake-omni: A large-scale codec-based deepfake speech dataset,

J. Du, X. Chen, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake-omni: A large-scale codec-based deepfake speech dataset,”arXiv e-prints, pp. arXiv– 2501, 2025

work page 2025
[37]

Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,”Advances in neural information processing systems, vol. 33, pp. 17 022– 17 033, 2020

work page 2020
[38]

Bigvgan: A universal neural vocoder with large-scale training,

S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” inThe Eleventh International Conference on Learning Represen- tations, 2023

work page 2023
[39]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

work page 2020
[40]

Libritts: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inInterspeech 2019, 2019, pp. 1526–1530

work page 2019
[41]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[42]

TwistList: Resources and baselines for tongue twister generation,

T. Loakman, C. Tang, and C. Lin, “TwistList: Resources and baselines for tongue twister generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), 2023, pp. 579–589

work page 2023
[43]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInternational Conference on Machine Learning, ICML 2023, 2023, pp. 28 492–28 518

work page 2023
[44]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[45]

Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,” inInterspeech 2021, 2021, pp. 2127–2131

work page 2021
[46]

Mosnet: Deep learning-based objec- tive assessment for voice conversion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “Mosnet: Deep learning-based objec- tive assessment for voice conversion,”Interspeech 2019, 2019

work page 2019