pith. sign in

arxiv: 2606.18323 · v1 · pith:7XU5COI3new · submitted 2026-06-16 · 💻 cs.SD · cs.LG

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

Pith reviewed 2026-06-26 22:44 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords neural codec TTSASR self-verificationcatastrophic failuresmodel distillationtext-to-speech reliabilityautoregressive generationbest-of-N sampling
0
0 comments X

The pith

Best-of-N ASR self-verification reduces catastrophic failures in neural-codec TTS to near zero across four systems and three codecs, with distillation recovering most of the gain in single-shot decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive neural-codec text-to-speech models produce high-quality speech on ordinary inputs but occasionally output silence, early termination, repetition, or hallucinated content. The paper demonstrates that a simple best-of-N selection process guided by an ASR round-trip check drives the rate of these failures to zero on standard test data by N=2 and on difficult prompts by N=4. The same pattern appears consistently across multiple open TTS models and codecs. Distilling the verified outputs back into the original model then restores a large fraction of the reliability without any added test-time computation.

Core claim

Under a single format-robust metric defined by ASR round-trip error rate, best-of-N ASR self-verification eliminates observed catastrophic failures by N=2 on LibriSpeech and by N=4 on a hard prompt set; the reduction holds across four open codec-TTS systems and three neural codecs. Distilling the self-verified outputs into the base model recovers 52-58 percent of the failure mass on hard inputs at single-shot inference cost, while offline DPO/IPO does not surpass plain supervised distillation and one larger model shows no benefit.

What carries the argument

ASR round-trip catastrophic-failure metric used for best-of-N selection, followed by distillation of the selected outputs into the base model.

If this is right

  • The failure reduction replicates across four different open codec-TTS systems and three neural codecs.
  • Distillation transfers most of the reliability gain to single-shot decoding, with the improvement concentrated on hard inputs.
  • Offline direct preference optimization does not outperform plain supervised distillation on this task.
  • A larger model in the tested family shows no improvement from the procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification-plus-distillation pattern could be tested on other autoregressive audio or music generation tasks that exhibit stochastic collapse.
  • If the hard prompt set captures real deployment edge cases, the method would directly improve production reliability without changing model architecture.
  • An online iterative version of the distillation process might close the remaining failure gap if evaluated at larger scale.

Load-bearing premise

The ASR round-trip check identifies exactly the same outputs that a human listener would judge as catastrophic failures.

What would settle it

Human raters listening to the same outputs and counting failures at a rate comparable to the original model rather than near zero would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18323 by Ali Asaria, Deep Gandhi, Tony Salomone.

Figure 1
Figure 1. Figure 1: Single-shot (base) vs. best-of-2 catastrophic-failure rate across four codec-TTS models [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Best-of-N failure rate by input difficulty (primary model). Hard inputs saturate at N=4, easy prose at N=2; the saturation point is the difficulty proxy associated with distillation headroom. x-axis N, y-axis catastrophic-failure rate. Capability ceiling. Rare words (e.g. otorhinolaryngologist) are a genuine limit rather than a sampling problem: only 2 of 10 such prompts ever yield a faithful sample within… view at source ↗
read the original abstract

Open autoregressive neural-codec text-to-speech (TTS) models sound excellent on typical inputs yet suffer stochastic catastrophic failures: on a meaningful fraction of utterances they emit silence, terminate early, or collapse into repetitive or hallucinated content. We show this failure mode is cheap to remove. Under a single format-robust metric (a catastrophic-failure rate via an ASR round-trip), best-of-N ASR self-verification drives failures to near-zero: no observed failures remain by N=2 on a standard corpus (LibriSpeech) and by N=4 on a hard prompt set. This is not an artifact of one model: the reduction replicates across four open codec-TTS systems and three neural codecs (XCodec2, SNAC, Mimi), reaching the near-zero floor by N=2 on three of the four. We then make the fix free at inference time by distilling the self-verified behaviour into the model, which recovers much of the robustness in single-shot decoding, closing ~52-58% of the failure mass on hard inputs at no test-time cost. The distillation gain concentrates where it is needed (hard inputs); on already-reliable prose there is no headroom and no detectable change. A controlled comparison adds a clean negative: offline direct preference optimization (DPO/IPO) does not beat plain supervised distillation, and an online iterative variant is promising but not statistically separable at our evaluation size. We report honestly the one model that resists (a larger Llasa where scale did not obviously help) and a rare-word capability ceiling that no self-distillation method overcomes

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that autoregressive neural-codec TTS models exhibit stochastic catastrophic failures (silence, early termination, repetition, hallucination) that can be driven to near-zero rates via best-of-N ASR self-verification under an ASR round-trip metric; this holds by N=2 on LibriSpeech and N=4 on a hard prompt set, replicates across four open codec-TTS systems and three neural codecs, and can be largely recovered at inference time by distilling the verified behavior into the base model (closing 52-58% of failure mass on hard inputs). A controlled comparison shows supervised distillation outperforms or matches DPO/IPO variants, with honest reporting of one resistant model and a rare-word ceiling.

Significance. If the ASR round-trip metric is validated against human perception, the work supplies a practical, model-agnostic, and low-cost route to reliable open neural-codec TTS, with the cross-system replication and the clean negative result on preference optimization constituting clear strengths. The distillation result is particularly useful because gains concentrate on hard inputs where they are needed.

major comments (2)
  1. [Abstract and evaluation section] Abstract and evaluation section: the central claim of 'near-zero catastrophic failures' and 'reliable' TTS rests on the ASR round-trip metric, yet no human listening study, inter-rater agreement statistic, or correlation coefficient between ASR-flagged failures and listener ratings of output quality is reported. This is load-bearing because ASR transcription errors or partial matches could misclassify subtle hallucinations or acceptable repetitions, undermining the perceptual reliability conclusion.
  2. [evaluation section] Hard prompt set construction (evaluation section): the claim that N=4 suffices on the hard set is central to the generality argument, but the manuscript provides no explicit criteria, sampling procedure, or statistics showing the set is representative of real deployment difficulties rather than an ad-hoc collection; without this, the replication across models cannot be fully assessed.
minor comments (2)
  1. [distillation section] The distillation procedure is described at a high level; adding the precise objective function or pseudocode would improve reproducibility without altering the central claim.
  2. [results tables] Table reporting failure rates should include per-model confidence intervals or exact counts to allow readers to judge whether the 'no observed failures' statement at N=2 is statistically robust given the corpus size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on metric validation and prompt-set construction. We address each point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and evaluation section] Abstract and evaluation section: the central claim of 'near-zero catastrophic failures' and 'reliable' TTS rests on the ASR round-trip metric, yet no human listening study, inter-rater agreement statistic, or correlation coefficient between ASR-flagged failures and listener ratings of output quality is reported. This is load-bearing because ASR transcription errors or partial matches could misclassify subtle hallucinations or acceptable repetitions, undermining the perceptual reliability conclusion.

    Authors: We agree a human correlation study would strengthen perceptual claims. The ASR round-trip metric is deliberately conservative (exact-match or high-WER threshold on the full transcript), targeting only the catastrophic modes (silence, early stop, repetition, hallucination) that produce unambiguous mismatches; these are not subtle and are unlikely to be misclassified by modern ASR. We will add a small-scale human listening study with correlation statistics in the revised evaluation section. revision: yes

  2. Referee: [evaluation section] Hard prompt set construction (evaluation section): the claim that N=4 suffices on the hard set is central to the generality argument, but the manuscript provides no explicit criteria, sampling procedure, or statistics showing the set is representative of real deployment difficulties rather than an ad-hoc collection; without this, the replication across models cannot be fully assessed.

    Authors: The hard set was built by selecting prompts that triggered elevated failure rates in preliminary runs on multiple models, emphasizing rare words, long/complex syntax, and out-of-domain content. We will revise the evaluation section to document the explicit selection criteria, sampling procedure, and supporting statistics (baseline failure rates vs. LibriSpeech) to clarify representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results use external ASR metric without self-referential reduction

full rationale

The paper reports experimental measurements of failure rates (silence, repetition, hallucination) via an independent ASR round-trip verifier applied to outputs from multiple codec-TTS systems. Best-of-N sampling and subsequent distillation are evaluated directly against this external metric on LibriSpeech and hard prompts. No derivation, equation, or central claim reduces by construction to a fitted parameter, self-citation chain, or renamed input; the metric is applied uniformly as an evaluation tool rather than being defined in terms of the claimed robustness gain. This is a standard empirical setup with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption that ASR round-trip evaluation reliably detects TTS failures.

free parameters (1)
  • N (number of candidates in best-of-N)
    N is chosen empirically to reach near-zero failures (N=2 on LibriSpeech, N=4 on hard prompts).
axioms (1)
  • domain assumption ASR round-trip provides a reliable and format-robust proxy for detecting catastrophic failures in TTS outputs.
    This is the central metric for both verification and evaluation.

pith-pipeline@v0.9.1-grok · 5840 in / 1398 out tokens · 54252 ms · 2026-06-26T22:44:11.239949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 3 linked inside Pith

  1. [2]

    URLhttps://arxiv.org/abs/2507.21138

  2. [3]

    The T05 System for the VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

    Kaito Baba, Wataru Nakata, Yuki Saito, and Hiroshi Saruwatari. The T05 System for the VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech. arXiv:2409.09305 [cs.SD], 2024. URL https://arxiv.org/abs/2409.09305

  3. [4]

    Preference-Based Learning in Audio Applications: A Systematic Analysis

    Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, and Nadir Weibel. Preference-Based Learning in Audio Applications: A Systematic Analysis. arXiv:2511.13936 [cs.SD], 2025. URL https://arxiv.org/abs/2511.13936

  4. [5]

    Orpheus: Towards Human-Sounding Speech (Orpheus-3B TTS)

    Canopy Labs. Orpheus: Towards Human-Sounding Speech (Orpheus-3B TTS). Model re- lease, HuggingFacecanopylabs/orpheus-3b-0.1-ft; code https://github.com/canopyai/ Orpheus-TTS, 2025. URLhttps://huggingface.co/canopylabs/orpheus-3b-0.1-ft

  5. [6]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117 [cs.SD], 2024. URLhttps://arxiv.org...

  6. [7]

    The Differences Between Direct Alignment Algorithms are a Blur

    Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. The Differences Between Direct Alignment Algorithms are a Blur. arXiv:2502.01237 [cs.LG], 2025. URLhttps://arxiv.org/abs/2502.01237

  7. [8]

    Desta, Roy Fejgin, Rafael Valle, and Jason Li

    Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, and Jason Li. Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance. arXiv:2502.05236 [cs.SD], 2025. URLhttps://arxiv.org/abs/2502.05236

  8. [9]

    Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization

    Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Ryan Langman, Mikyas Desta, Leili Tavabi, and Jason Li. Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization. arXiv:2509.21718 [cs.AI], 2025. URLhttps://arxiv.org/abs/2509.21718

  9. [10]

    MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

    Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, and Zhou Zhao. MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis. arXiv:2502.18924 [eess.AS], 2025. URLhttps://arxiv.org/abs/2502.18924

  10. [11]

    A Survey of Direct Preference Optimization

    Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, and Dacheng Tao. A Survey of Direct Preference Optimization. arXiv:2503.11701 [cs.LG], 2025. URL https://arxiv.org/abs/ 2503.11701

  11. [12]

    TTSDS – Text-to-Speech Distribution Score

    Christoph Minixhofer, Ondřej Klejch, and Peter Bell. TTSDS – Text-to-Speech Distribution Score. arXiv:2407.12707 [eess.AS], 2024. URLhttps://arxiv.org/abs/2407.12707

  12. [13]

    CSM: A Conversational Speech Generation Model

    Sesame AI. CSM: A Conversational Speech Generation Model. Model release, Hugging- Face sesame/csm-1b; code https://github.com/SesameAILabs/csm, 2025. URL https: //huggingface.co/sesame/csm-1b

  13. [14]

    Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

    Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, and Zhizheng Wu. Metis: A Foundation Speech Generation Model with Masked Generative Pre-training. arXiv:2502.03128 [cs.SD], 2025. URLhttps://arxiv.org/abs/2502.03128

  14. [15]

    A Comprehensive Survey of Di- rect Preference Optimization: Datasets, Theories, Variants, and Applications

    Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, and Fei Wu. A Comprehensive Survey of Di- rect Preference Optimization: Datasets, Theories, Variants, and Applications. arXiv:2410.15595 [cs.AI], 2024. URLhttps://arxiv.org/abs/2410.15595

  15. [16]

    Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

    Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. arXiv:2502.04128 [eess.AS], 2025. UR...

  16. [18]

    URLhttps://arxiv.org/abs/2508.17229. 9