pith. sign in

arxiv: 2606.09234 · v1 · pith:IJCIFOTYnew · submitted 2026-06-08 · 💻 cs.SD · cs.AI

End-to-End Training for Discrete Token LLM based TTS System

Pith reviewed 2026-06-27 15:20 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords end-to-end trainingtext-to-speechdiscrete tokenslarge language modelflow matchingspeech tokenizerword error rateautoregressive model
0
0 comments X

The pith

Jointly optimizing the speech tokenizer, LLM, flow-matching model and reward model produces discrete tokens better suited to text-to-speech than separately trained components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a cascaded TTS pipeline of tokenizer, autoregressive LLM and flow-matching decoder can be trained as a single system instead of in separate stages. Multi-task objectives on the tokenizer combine reconstruction, next-token prediction and recognition signals so the resulting discrete tokens carry both acoustic and semantic detail useful for synthesis. The LLM is then further tuned with feedback from the decoder and reward model to reduce mismatch at generation time. On the Seed-TTS-Eval benchmark the resulting system reaches word error rates of 0.78 percent and 1.56 percent using a 0.6 B LLM and 0.5 B flow-matching model, beating prior cascaded baselines. The unified training removes the need for independent optimization of each stage.

Core claim

Jointly optimizing the speech tokenizer using reconstruction for the flow-matching model, next-token prediction for the LLM and multi-recognition for the reward model creates a discrete token space better tailored to TTS; further optimizing the LLM with downstream reconstruction and recognition objectives then reduces inference mismatch and steers generations toward preferred outputs, yielding consistent gains over independently trained cascaded pipelines.

What carries the argument

The end-to-end optimization framework that applies multi-task objectives to the tokenizer and downstream reconstruction and recognition objectives to the LLM.

If this is right

  • The unified system consistently outperforms cascaded baselines on TTS tasks.
  • New state-of-the-art word error rates of 0.78 percent and 1.56 percent are reached on Seed-TTS-Eval.
  • The gains hold with a 0.6 B-parameter LLM and 0.5 B-parameter flow-matching model.
  • Training is simplified because the components no longer require separate optimization stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-objective approach could be tested on other discrete-token pipelines such as music or sound-effect generation.
  • Reducing decoder-LLM mismatch through downstream feedback may apply to any autoregressive generation system that uses a separate decoder.
  • The resulting token space might also improve performance on downstream speech recognition or understanding tasks.

Load-bearing premise

The joint multi-task objectives on the tokenizer produce a discrete token space measurably better suited to TTS than tokens trained independently.

What would settle it

An ablation that trains the tokenizer independently on the same data and measures whether word error rate on Seed-TTS-Eval rises back to the level of the cascaded baseline.

Figures

Figures reproduced from arXiv: 2606.09234 by Changfeng Gao, Jun Yuan, ShiDong Shang, Ye Bai, Yong Ren, Zhao You.

Figure 1
Figure 1. Figure 1: Illustrations of the E2E TTS framework. The LM, FM and RM can take both [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three stage training for the E2E TTS. Specifically, we replace the ground-truth tokens q1:T with either the Gumbel–Softmax hard samples or the hidden states of the LLM, denoted as h LM 1:T : q ′ t = CodeBook( GumbelSoftmax(p(ct | c1:t−1))) or h LM t . (10) Accordingly, the reconstruction and flow-matching losses computed using LLM-predicted tokens are defined as LLRM and LLFM, respectively. The combination… view at source ↗
Figure 3
Figure 3. Figure 3: Distributions for speech token trained with different loss functions on the LibriTTS [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a fully end-to-end optimization framework for discrete-token LLM-based TTS that jointly trains the speech tokenizer on multi-task objectives (reconstruction for the flow-matching model, next-token prediction for the LLM, and multi-recognition for an added reward model), then further optimizes the LLM using downstream reconstruction and recognition signals. This is claimed to produce tokens better tailored to TTS and reduce inference-time mismatch, yielding consistent outperformance of cascaded baselines and new SOTA WERs of 0.78% and 1.56% on Seed-TTS-Eval using a 0.6B LLM and 0.5B FM model.

Significance. If the central empirical claims hold after proper controls, the work would demonstrate that holistic joint training can improve discrete-token TTS performance while simplifying the pipeline relative to independent component training, providing evidence that multi-task tokenization and downstream LLM steering are effective for reducing mismatch in cascaded systems.

major comments (2)
  1. [Abstract / Experimental results] Abstract and experimental results: the headline claim that joint multi-task training of the tokenizer yields a discrete token space measurably better suited to TTS (producing the reported SOTA WER) is load-bearing yet unsupported by any ablation that trains an otherwise identical tokenizer on single objectives or independent stages and measures the resulting WER delta when the same LLM+FM are attached.
  2. [Abstract] Abstract: the reported WER improvements of 0.78% and 1.56% are presented without quantitative ablations isolating the contribution of each objective, without error bars, and without statistical tests comparing E2E versus cascaded baselines.
minor comments (2)
  1. The manuscript should clarify the precise training schedule, data splits, and hyperparameter selection procedure to allow assessment of whether evaluation metrics influenced model selection.
  2. Details on the architecture and training of the reward model (RM) are insufficient for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below and will revise the paper to incorporate additional experiments and analysis as needed.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] Abstract and experimental results: the headline claim that joint multi-task training of the tokenizer yields a discrete token space measurably better suited to TTS (producing the reported SOTA WER) is load-bearing yet unsupported by any ablation that trains an otherwise identical tokenizer on single objectives or independent stages and measures the resulting WER delta when the same LLM+FM are attached.

    Authors: We agree that a direct ablation isolating multi-task tokenizer training versus single-objective or independent-stage training would provide stronger support for the claim that the joint objectives produce a token space better suited to TTS. The current manuscript compares the full E2E system against cascaded baselines (independently trained components), which provides indirect evidence, but does not include the requested tokenizer-specific ablation with fixed LLM+FM. We will add these ablations in the revision, training otherwise identical tokenizers on single objectives and reporting the resulting WER deltas. revision: yes

  2. Referee: [Abstract] Abstract: the reported WER improvements of 0.78% and 1.56% are presented without quantitative ablations isolating the contribution of each objective, without error bars, and without statistical tests comparing E2E versus cascaded baselines.

    Authors: We acknowledge that the reported WER figures lack error bars, statistical tests, and per-objective contribution ablations. In the revised manuscript we will report results across multiple random seeds with error bars, include statistical significance tests (e.g., paired t-tests) against the cascaded baselines, and expand the ablation section to quantify the contribution of each training objective (reconstruction, next-token prediction, and recognition). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training claims with no derivations or self-referential reductions

full rationale

The paper describes an end-to-end training procedure for a TTS system using joint multi-task objectives on a tokenizer, followed by LLM optimization, and reports measured WER improvements on Seed-TTS-Eval. No equations, derivations, or first-principles results are present. The central claim (joint objectives yield a better token space) is an empirical assertion supported by benchmark numbers rather than any definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The result remains externally falsifiable via independent replication or ablation, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the described multi-task objectives. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1339 out tokens · 19255 ms · 2026-06-27T15:20:20.182602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

  2. [2]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589, 2025

  3. [3]

    Qwen3-TTS Technical Report

    Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621, 2026

  4. [4]

    Asq: An ultra-low bit rate asr-oriented speech quantization method

    Lingxuan Ye, Changfeng Gao, Gaofeng Cheng, Liuping Luo, and Qingwei Zhao. Asq: An ultra-low bit rate asr-oriented speech quantization method. IEEE Signal Processing Letters, 31:221–225, 2023

  5. [5]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

  6. [6]

    Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,

    Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. CoRR, abs/2409.03283, 2024

  7. [7]

    Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis

    Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, et al. Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis. arXiv preprint arXiv:2512.19090, 2025

  8. [8]

    Differentiable reward optimization for llm based tts system

    Changfeng Gao, Zhihao Du, and Shiliang Zhang. Differentiable reward optimization for llm based tts system. arXiv preprint arXiv:2507.05911, 2025

  9. [9]

    Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations

    Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Hsin-Min Wang, and Yu Tsao. Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations. In 2024 Asia Pacific Signal and Information Processing Associa- tion Annual Summit and Conference...

  10. [10]

    Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs

    Wei-Cheng Tseng and David Harwath. Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs. arXiv preprint arXiv:2511.16639, 2025

  11. [11]

    Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling

    Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, and Zhizheng Wu. Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling. Advances in Neural Information Processing Systems, 38:147494–147523, 2026

  12. [12]

    Scaling speech tokenizers with diffusion autoencoders

    Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, et al. Scaling speech tokenizers with diffusion autoencoders. arXiv preprint arXiv:2602.06602, 2026

  13. [13]

    Finite scalar quantization: VQ-VAE made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. In ICLR. OpenReview.net, 2024. 10

  14. [14]

    Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text- to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

  15. [15]

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

  16. [16]

    F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion

    Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion. arXiv preprint arXiv:2504.02407, 2025

  17. [17]

    Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching

    Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching. arXiv preprint arXiv:2506.13053, 2025

  18. [18]

    OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, and Daniel Povey. Omnivoice: Towards omnilingual zero- shot text-to-speech with diffusion language models. arXiv preprint arXiv:2604.00688, 2026

  19. [19]

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025

  20. [20]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024

  21. [21]

    Faster whisper large v3, 2023

    Systran. Faster whisper large v3, 2023

  22. [22]

    Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms

    Tongyi Speech Team. Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024

  23. [23]

    Tera: Self-supervised learning of trans- former encoder representation for speech

    Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of trans- former encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021

  24. [24]

    vq-wav2vec: Self-supervised learn- ing of discrete speech representations

    Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learn- ing of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019. 11