RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Frederik Bous; Hyeongju Kim; Jinhyeok Yang; Joon Byun; Juheon Lee; Yechan Yu

arxiv: 2605.22083 · v1 · pith:ZJYIJ3HRnew · submitted 2026-05-21 · 💻 cs.SD · cs.LG· eess.AS

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Jinhyeok Yang , Hyeongju Kim , Yechan Yu , Joon Byun , Frederik Bous , Juheon Lee This is my paper

Pith reviewed 2026-05-22 02:51 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords text-to-speechflow matchingcontrastive learningdata augmentationalignment robustnesszero-shot synthesisspeech intelligibility

0 comments

The pith

Augmenting flow-matching TTS with repeat and skip latent examples reduces alignment errors in zero-shot speech synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix content fidelity problems in flow-matching text-to-speech systems, where imperfect alignment causes skip and repeat errors. It introduces RobustSpeechFlow, a training strategy that extends contrastive flow matching using length-preserving repeat and skip latent augmentations. This approach penalizes common failure modes directly during training without needing external aligners or preference data. The method integrates easily into existing pipelines and shows measurable improvements in intelligibility on standard benchmarks. A sympathetic reader would care because it offers a simple way to make high-quality zero-shot TTS more reliable for real-world use.

Core claim

By extending contrastive flow matching with length-preserving repeat and skip latent augmentations, RobustSpeechFlow directly penalizes realistic alignment failure modes in TTS, improving content fidelity while requiring no external aligners or preference data and integrating readily into existing pipelines.

What carries the argument

Length-preserving repeat and skip latent augmentations applied within a contrastive flow matching framework to penalize skip and repeat errors.

If this is right

On Seed-TTS-eval, word error rate drops from 1.44 to 1.38 with a 0.06B parameter model.
On ZERO500, English CER reduces from 0.48% to 0.35% and Korean CER from 0.81% to 0.57% at NFE=24.
Consistent intelligibility gains across diverse speaker and prosody conditions.
Improves robustness without changing the base flow-matching objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could extend to other generative models facing alignment issues, such as in video or music generation.
Testing on more languages or noisy real-world inputs might reveal further benefits or limitations.
Combining with other robustness techniques could compound the error reductions.
The augmentations might generalize to different flow-based TTS architectures beyond the one tested.

Load-bearing premise

That length-preserving repeat and skip latent augmentations sufficiently cover the realistic failure modes of flow-matching TTS alignment without introducing new artifacts or requiring changes to the base flow-matching objective.

What would settle it

Observing no reduction in skip or repeat errors on a held-out test set with varied prosody or speakers, or detecting new artifacts in generated audio that increase overall error rates.

Figures

Figures reproduced from arXiv: 2605.22083 by Frederik Bous, Hyeongju Kim, Jinhyeok Yang, Joon Byun, Juheon Lee, Yechan Yu.

**Figure 1.** Figure 1: provides a clearer perspective on how alignment stability evolves over the course of training, contextualizing the final results from [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds length-preserving repeat and skip latent augmentations to contrastive flow matching and reports small intelligibility gains on TTS benchmarks.

read the letter

The one or two things to know are that the authors propose adding length-preserving repeat and skip augmentations to the latent trajectories during contrastive flow matching training for text-to-speech. This is meant to make the model more robust to common alignment problems that cause skips and repeats in the generated speech. What stands out as new is the specific way they apply these augmentations inside the contrastive framework for flow matching TTS, which doesn't appear in the prior work they reference. The paper does a solid job of presenting a method that requires no additional aligners or human preference data and can be added to current pipelines with relative ease. They back this with results showing small improvements in word error rate on Seed-TTS-eval and character error rates on their ZERO500 set for English and Korean speakers. On the downside, the absolute improvements are quite small, and the description does not include enough on training details, whether the changes are statistically significant, or ablations that isolate the contribution of the augmentations versus other factors. The assumption that these particular latent operations directly penalize realistic failure modes without introducing other problems is not strongly supported by any propagation analysis or error diagnostics in the summary. This makes it difficult to rule out that the gains are from general regularization effects rather than the targeted robustness. Overall, this is aimed at specialists in zero-shot TTS and flow-matching generative models. Someone working on improving content accuracy in speech synthesis would find the approach useful to consider. Given the concrete benchmarks and the focused contribution, it is worth sending for peer review so that experts can examine the full experiments and any additional analysis in the paper.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RobustSpeechFlow, a training strategy for flow-matching text-to-speech that extends contrastive flow matching with length-preserving repeat and skip latent augmentations to improve robustness against alignment errors such as skips and repeats. The method requires no external aligners or preference data and is presented as readily integrable into existing pipelines. It reports modest empirical gains: WER reduction from 1.44 to 1.38 on Seed-TTS-eval using a 0.06B parameter model, and CER reductions on the ZERO500 benchmark (English: 0.48% to 0.35%; Korean: 0.81% to 0.57%) at NFE=24.

Significance. If the reported gains prove robust and attributable to the targeted augmentations rather than generic regularization, the approach offers a lightweight, data-free way to mitigate content fidelity issues in flow-matching TTS. The integration of contrastive learning directly on latent trajectories is a reasonable direction for alignment robustness. However, the small effect sizes and absence of mechanistic validation limit the potential significance; the work would benefit from stronger evidence that the method addresses realistic failure modes without introducing new artifacts.

major comments (2)

[Abstract] Abstract: The concrete error-rate reductions (WER 1.44→1.38; CER 0.48%→0.35% and 0.81%→0.57%) are presented without any details on training setup, number of runs, statistical significance testing, baseline comparisons, or ablation of the augmentation strategy. This absence prevents verification that the central claim of improved alignment robustness holds and is not due to uncontrolled factors.
[Method] Method section (augmentation description): The load-bearing assumption that length-preserving repeat and skip operations on latent trajectories specifically penalize skip/repeat content errors in the decoded output lacks any diagnostic analysis. No evidence is provided on how these perturbations propagate through the learned vector field to produce corresponding phonetic-level errors in the mel-spectrogram or waveform, nor whether the contrastive term alters the base flow-matching objective beyond generic regularization. This directly undermines attribution of the small observed gains to the proposed mechanism.

minor comments (2)

[Experiments] The manuscript should clarify the construction and diversity of the ZERO500 benchmark, including speaker/prosody conditions and how it differs from existing evaluation sets.
[Abstract] Audio samples are linked, which is positive; however, the paper would benefit from quantitative analysis of other metrics (e.g., speaker similarity, naturalness) alongside the reported intelligibility measures to ensure no trade-offs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The concrete error-rate reductions (WER 1.44→1.38; CER 0.48%→0.35% and 0.81%→0.57%) are presented without any details on training setup, number of runs, statistical significance testing, baseline comparisons, or ablation of the augmentation strategy. This absence prevents verification that the central claim of improved alignment robustness holds and is not due to uncontrolled factors.

Authors: We agree that the abstract's brevity omits key experimental context. In the revised version we will expand the abstract to note the 0.06B model size, the training dataset and configuration, that results are reported as averages over multiple runs, and explicit pointers to the main-text sections containing baseline comparisons, augmentation ablations, and any statistical analysis. These additions will make the reported gains more verifiable without altering the abstract's length constraints. revision: yes
Referee: [Method] Method section (augmentation description): The load-bearing assumption that length-preserving repeat and skip operations on latent trajectories specifically penalize skip/repeat content errors in the decoded output lacks any diagnostic analysis. No evidence is provided on how these perturbations propagate through the learned vector field to produce corresponding phonetic-level errors in the mel-spectrogram or waveform, nor whether the contrastive term alters the base flow-matching objective beyond generic regularization. This directly undermines attribution of the small observed gains to the proposed mechanism.

Authors: We acknowledge the absence of direct mechanistic diagnostics in the current manuscript. The paper does contain ablation results isolating the contribution of the repeat/skip augmentations over plain contrastive flow matching, supporting that the gains are not purely generic regularization. To strengthen attribution we will add a short diagnostic subsection (or appendix) with trajectory visualizations and a controlled comparison of vector-field behavior under the proposed augmentations versus random perturbations. We maintain that the length-preserving design specifically mimics realistic alignment failures, but we accept that additional propagation analysis will improve the manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical augmentation strategy remains independent of its reported gains

full rationale

The paper introduces RobustSpeechFlow as a training addition that extends contrastive flow matching via length-preserving repeat and skip latent augmentations to penalize alignment errors. Reported WER/CER reductions (1.44→1.38 on Seed-TTS-eval; 0.48%→0.35% English and 0.81%→0.57% Korean on ZERO500) are presented as experimental outcomes on fixed benchmarks rather than quantities defined by the augmentations themselves or by self-citation chains. No equations or sections reduce the central claim to a fitted parameter renamed as prediction, a self-definitional loop, or an ansatz smuggled via prior work by the same authors. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or load-bearing self-citations to justify its improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. Standard flow-matching assumptions (continuous normalizing flows, optimal transport paths) are implicitly used but not detailed.

pith-pipeline@v0.9.0 · 5732 in / 1146 out tokens · 27856 ms · 2026-05-22T02:51:45.844368+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_pos − λ_rand L_rand − λ_aug L_aug

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

[1]

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Introduction In text-to-speech (TTS), the primary requirement iscontent fi- delity: the system must render the intended text accurately. In production, failures such as word or phraserepetitionandskip- pingare not minor artifacts; they materially reduce reliability and can create safety and compliance risks. As modern gener- ative TTS systems continue to ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Alignment Robustness Alignment errors have been a persistent challenge since early deep-learning-based TTS

Background 2.1. Alignment Robustness Alignment errors have been a persistent challenge since early deep-learning-based TTS. Tacotron 2 improves attention sta- bility with location-sensitive attention [14]. Deep Convolu- tional TTS (DCTTS) introduces a guided-attention loss and a forced-incremental attention heuristic at synthesis time to miti- gate skippe...

work page
[3]

RobustSpeechFlow 3.1. Preliminaries: Conditional Flow Matching Letx∈R C×T be a continuous latent speech sequence pro- duced by theSupertonic speech autoencoder[3], and letcde- note conditioning inputs (text and optional speaker prompt). We use a standard linear probability path for conditional flow matching: ϵ∼ N(0, I), t∼U(0,1), x t = (1−t)ϵ+tx.(1) The t...

work page
[4]

Experimental Setup 4.1. Training Data We train on internal corpora of approximately 10k hours, 5M ut- terances, and 80k speakers per language (English and Korean), utilizing a mix of human-annotated and ASR-generated tran- scriptions. 4.2. Model and Baselines We apply RobustSpeechFlow to SupertonicTTS [3], a compact flow-matching model operating onSuperto...

work page
[5]

Zero-shot Intelligibility on Seed-TTS-eval We first evaluate on the public Seed-TTS-eval benchmark [21]

Results 5.1. Zero-shot Intelligibility on Seed-TTS-eval We first evaluate on the public Seed-TTS-eval benchmark [21]. Table 1 summarizes the benchmark comparison against repre- sentative zero-shot TTS systems together with our in-family baselines. Within the compact SupertonicTTS setup, standard ContrastiveFM already reduces the WER from 1.44 to 1.41, and...

work page
[6]

Discussion and Conclusion We presented RobustSpeechFlow, a zero-shot TTS train- ing strategy that improves content fidelity through length- preserving repeat and skip augmentations in contrastive flow matching. By simulating these common failure modes within the latent space, our approach directly aligns the contrastive penalty with the structured errors ...

work page
[7]

The core text and all scientific contributions were fully developed and authored by the human authors

Generative AI Use Disclosure Generative AI tools were used exclusively for proofreading, im- proving the grammatical quality of the manuscript, and assisting in the preparation of the demo webpage. The core text and all scientific contributions were fully developed and authored by the human authors

work page
[8]

Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain- specific factors,

K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho, “Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain- specific factors,” inInternational Conference on Learning Repre- sentations (ICLR), 2025

work page 2025
[9]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

work page 2025
[10]

Supertonictts: Towards highly efficient and streamlined text-to-speech system,

H. Kim, J. Yang, Y . Yu, S. Ji, J. Morton, F. Bous, J. Byun, and J. Lee, “Supertonictts: Towards highly efficient and streamlined text-to-speech system,”arXiv preprint arXiv:2503.23108, 2025

work page arXiv 2025
[11]

Multimodal latent language modeling with next- token diffusion,

Y . Sun, H. Bao, W. Wang, Z. Peng, L. Dong, S. Huang, J. Wang, and F. Wei, “Multimodal latent language modeling with next- token diffusion,”arXiv preprint arXiv:2412.08635, 2024

work page arXiv 2024
[12]

Ditar: Diffusion transformer autoregressive modeling for speech generation,

D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y . Wanget al., “Ditar: Diffusion transformer autoregressive modeling for speech generation,”arXiv preprint arXiv:2502.03930, 2025

work page arXiv 2025
[13]

V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, Z. Wu, and Z. Liu, “V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,” arXiv preprint arXiv:2509.24650, 2025

work page arXiv 2025
[14]

Vibevoice: Expressive podcast generation with next-token dif- fusion,

Z. Peng, J. Yu, W. Wang, Y . Chang, Y . Sun, L. Dong, Y . Zhu, W. Xu, H. Bao, Z. Wang, S. Huang, Y . Xia, and F. Wei, “Vibevoice: Expressive podcast generation with next-token dif- fusion,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[15]

Gigaspeech: An evolv- ing, multi-domain asr corpus with 10,000 hours of transcribed au- dio,

G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolv- ing, multi-domain asr corpus with 10,000 hours of transcribed au- dio,” inProc. Interspeech 2021, 2021, pp. 3670–3674

work page 2021
[16]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,”2024 IEEE Spo- ken Language Technology Workshop (SLT), pp. 885–890, 2024

work page 2024
[17]

Dual- speech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance,

J. Yang, J. Lee, H.-S. Choi, S. Ji, H. Kim, and J. Lee, “Dual- speech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance,” inProc. Interspeech, 2024

work page 2024
[18]

Ad- vancing zero-shot text-to-speech intelligibility across diverse do- mains via preference alignment,

X. Zhang, Y . Wang, C. Wang, Z. Li, Z. Chen, and Z. Wu, “Ad- vancing zero-shot text-to-speech intelligibility across diverse do- mains via preference alignment,” inProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 251–12 270

work page 2025
[19]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Infor- mation Processing Systems, 2023

work page 2023
[20]

Dmdspeech: Distilled diffusion model surpassing the teacher in zero-shot speech synthesis via di- rect metric optimization,

Y . A. Li, R. Kumar, and Z. Jin, “Dmdspeech: Distilled diffusion model surpassing the teacher in zero-shot speech synthesis via di- rect metric optimization,” inInternational Conference on Learn- ing Representations (ICLR), 2025

work page 2025
[21]

Natural tts synthesis by condi- tioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. J. Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural tts synthesis by condi- tioning wavenet on mel spectrogram predictions,” inProc. IEEE ICASSP, 2018

work page 2018
[22]

Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” inProc. IEEE ICASSP, 2018, pp. 4784–4788

work page 2018
[23]

Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376

work page 2006
[24]

Length-aware rotary position embedding for text-speech alignment,

H. Kim, J. Lee, J. Yang, and J. Morton, “Length-aware rotary position embedding for text-speech alignment,”arXiv preprint arXiv:2509.11084, 2025

work page arXiv 2025
[25]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 1597–1607

work page 2020
[26]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 9729–9738

work page 2020
[27]

Contrastive flow matching,

G. Stoica, V . Ramanujan, X. Fan, A. Farhadi, R. Krishna, and J. Hoffman, “Contrastive flow matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[28]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chenet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[30]

Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,

Z. Jiang, Y . Ren, R. Li, S. Ji, B. Zhang, Z. Ye, C. Zhang, J. Bai, X. Yang, J. Zuo, Y . Zhang, R. Liu, X. Yin, and Z. Zhao, “Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,”arXiv preprint arXiv:2502.18924, 2025

work page arXiv 2025
[31]

Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,

B. Zhang, C. Guo, G. Yang, H. Yu, H. Zhang, H. Lei, J. Mai, J. Yan, K. Yang, M. Yang, P. Huang, R. Jin, S. Jiang, W. Cheng, Y . Li, Y . Xiao, Y . Zhou, Y . Zhang, Y . Lu, and Y . He, “Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,”arXiv preprint arXiv:2505.07916, 2025

work page arXiv 2025
[32]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, X. Shi, K. Anet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Introducing s1,

OpenAudio, “Introducing s1,” OpenAudio technical blog, 2025, june 3, 2025

work page 2025
[35]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025

[1] [1]

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Introduction In text-to-speech (TTS), the primary requirement iscontent fi- delity: the system must render the intended text accurately. In production, failures such as word or phraserepetitionandskip- pingare not minor artifacts; they materially reduce reliability and can create safety and compliance risks. As modern gener- ative TTS systems continue to ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Alignment Robustness Alignment errors have been a persistent challenge since early deep-learning-based TTS

Background 2.1. Alignment Robustness Alignment errors have been a persistent challenge since early deep-learning-based TTS. Tacotron 2 improves attention sta- bility with location-sensitive attention [14]. Deep Convolu- tional TTS (DCTTS) introduces a guided-attention loss and a forced-incremental attention heuristic at synthesis time to miti- gate skippe...

work page

[3] [3]

RobustSpeechFlow 3.1. Preliminaries: Conditional Flow Matching Letx∈R C×T be a continuous latent speech sequence pro- duced by theSupertonic speech autoencoder[3], and letcde- note conditioning inputs (text and optional speaker prompt). We use a standard linear probability path for conditional flow matching: ϵ∼ N(0, I), t∼U(0,1), x t = (1−t)ϵ+tx.(1) The t...

work page

[4] [4]

Experimental Setup 4.1. Training Data We train on internal corpora of approximately 10k hours, 5M ut- terances, and 80k speakers per language (English and Korean), utilizing a mix of human-annotated and ASR-generated tran- scriptions. 4.2. Model and Baselines We apply RobustSpeechFlow to SupertonicTTS [3], a compact flow-matching model operating onSuperto...

work page

[5] [5]

Zero-shot Intelligibility on Seed-TTS-eval We first evaluate on the public Seed-TTS-eval benchmark [21]

Results 5.1. Zero-shot Intelligibility on Seed-TTS-eval We first evaluate on the public Seed-TTS-eval benchmark [21]. Table 1 summarizes the benchmark comparison against repre- sentative zero-shot TTS systems together with our in-family baselines. Within the compact SupertonicTTS setup, standard ContrastiveFM already reduces the WER from 1.44 to 1.41, and...

work page

[6] [6]

Discussion and Conclusion We presented RobustSpeechFlow, a zero-shot TTS train- ing strategy that improves content fidelity through length- preserving repeat and skip augmentations in contrastive flow matching. By simulating these common failure modes within the latent space, our approach directly aligns the contrastive penalty with the structured errors ...

work page

[7] [7]

The core text and all scientific contributions were fully developed and authored by the human authors

Generative AI Use Disclosure Generative AI tools were used exclusively for proofreading, im- proving the grammatical quality of the manuscript, and assisting in the preparation of the demo webpage. The core text and all scientific contributions were fully developed and authored by the human authors

work page

[8] [8]

Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain- specific factors,

K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho, “Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain- specific factors,” inInternational Conference on Learning Repre- sentations (ICLR), 2025

work page 2025

[9] [9]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

work page 2025

[10] [10]

Supertonictts: Towards highly efficient and streamlined text-to-speech system,

H. Kim, J. Yang, Y . Yu, S. Ji, J. Morton, F. Bous, J. Byun, and J. Lee, “Supertonictts: Towards highly efficient and streamlined text-to-speech system,”arXiv preprint arXiv:2503.23108, 2025

work page arXiv 2025

[11] [11]

Multimodal latent language modeling with next- token diffusion,

Y . Sun, H. Bao, W. Wang, Z. Peng, L. Dong, S. Huang, J. Wang, and F. Wei, “Multimodal latent language modeling with next- token diffusion,”arXiv preprint arXiv:2412.08635, 2024

work page arXiv 2024

[12] [12]

Ditar: Diffusion transformer autoregressive modeling for speech generation,

D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y . Wanget al., “Ditar: Diffusion transformer autoregressive modeling for speech generation,”arXiv preprint arXiv:2502.03930, 2025

work page arXiv 2025

[13] [13]

V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, Z. Wu, and Z. Liu, “V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,” arXiv preprint arXiv:2509.24650, 2025

work page arXiv 2025

[14] [14]

Vibevoice: Expressive podcast generation with next-token dif- fusion,

Z. Peng, J. Yu, W. Wang, Y . Chang, Y . Sun, L. Dong, Y . Zhu, W. Xu, H. Bao, Z. Wang, S. Huang, Y . Xia, and F. Wei, “Vibevoice: Expressive podcast generation with next-token dif- fusion,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[15] [15]

Gigaspeech: An evolv- ing, multi-domain asr corpus with 10,000 hours of transcribed au- dio,

G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolv- ing, multi-domain asr corpus with 10,000 hours of transcribed au- dio,” inProc. Interspeech 2021, 2021, pp. 3670–3674

work page 2021

[16] [16]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,”2024 IEEE Spo- ken Language Technology Workshop (SLT), pp. 885–890, 2024

work page 2024

[17] [17]

Dual- speech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance,

J. Yang, J. Lee, H.-S. Choi, S. Ji, H. Kim, and J. Lee, “Dual- speech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance,” inProc. Interspeech, 2024

work page 2024

[18] [18]

Ad- vancing zero-shot text-to-speech intelligibility across diverse do- mains via preference alignment,

X. Zhang, Y . Wang, C. Wang, Z. Li, Z. Chen, and Z. Wu, “Ad- vancing zero-shot text-to-speech intelligibility across diverse do- mains via preference alignment,” inProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 251–12 270

work page 2025

[19] [19]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Infor- mation Processing Systems, 2023

work page 2023

[20] [20]

Dmdspeech: Distilled diffusion model surpassing the teacher in zero-shot speech synthesis via di- rect metric optimization,

Y . A. Li, R. Kumar, and Z. Jin, “Dmdspeech: Distilled diffusion model surpassing the teacher in zero-shot speech synthesis via di- rect metric optimization,” inInternational Conference on Learn- ing Representations (ICLR), 2025

work page 2025

[21] [21]

Natural tts synthesis by condi- tioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. J. Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural tts synthesis by condi- tioning wavenet on mel spectrogram predictions,” inProc. IEEE ICASSP, 2018

work page 2018

[22] [22]

Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” inProc. IEEE ICASSP, 2018, pp. 4784–4788

work page 2018

[23] [23]

Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376

work page 2006

[24] [24]

Length-aware rotary position embedding for text-speech alignment,

H. Kim, J. Lee, J. Yang, and J. Morton, “Length-aware rotary position embedding for text-speech alignment,”arXiv preprint arXiv:2509.11084, 2025

work page arXiv 2025

[25] [25]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 1597–1607

work page 2020

[26] [26]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 9729–9738

work page 2020

[27] [27]

Contrastive flow matching,

G. Stoica, V . Ramanujan, X. Fan, A. Farhadi, R. Krishna, and J. Hoffman, “Contrastive flow matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[28] [28]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chenet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[30] [30]

Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,

Z. Jiang, Y . Ren, R. Li, S. Ji, B. Zhang, Z. Ye, C. Zhang, J. Bai, X. Yang, J. Zuo, Y . Zhang, R. Liu, X. Yin, and Z. Zhao, “Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis,”arXiv preprint arXiv:2502.18924, 2025

work page arXiv 2025

[31] [31]

Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,

B. Zhang, C. Guo, G. Yang, H. Yu, H. Zhang, H. Lei, J. Mai, J. Yan, K. Yang, M. Yang, P. Huang, R. Jin, S. Jiang, W. Cheng, Y . Li, Y . Xiao, Y . Zhou, Y . Zhang, Y . Lu, and Y . He, “Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder,”arXiv preprint arXiv:2505.07916, 2025

work page arXiv 2025

[32] [32]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, X. Shi, K. Anet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Introducing s1,

OpenAudio, “Introducing s1,” OpenAudio technical blog, 2025, june 3, 2025

work page 2025

[35] [35]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025