End-to-End Training for Discrete Token LLM based TTS System

Changfeng Gao; Jun Yuan; ShiDong Shang; Ye Bai; Yong Ren; Zhao You

arxiv: 2606.09234 · v1 · pith:IJCIFOTYnew · submitted 2026-06-08 · 💻 cs.SD · cs.AI

End-to-End Training for Discrete Token LLM based TTS System

Changfeng Gao , Yong Ren , Jun Yuan , Ye Bai , Zhao You , ShiDong Shang This is my paper

Pith reviewed 2026-06-27 15:20 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords end-to-end trainingtext-to-speechdiscrete tokenslarge language modelflow matchingspeech tokenizerword error rateautoregressive model

0 comments

The pith

Jointly optimizing the speech tokenizer, LLM, flow-matching model and reward model produces discrete tokens better suited to text-to-speech than separately trained components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a cascaded TTS pipeline of tokenizer, autoregressive LLM and flow-matching decoder can be trained as a single system instead of in separate stages. Multi-task objectives on the tokenizer combine reconstruction, next-token prediction and recognition signals so the resulting discrete tokens carry both acoustic and semantic detail useful for synthesis. The LLM is then further tuned with feedback from the decoder and reward model to reduce mismatch at generation time. On the Seed-TTS-Eval benchmark the resulting system reaches word error rates of 0.78 percent and 1.56 percent using a 0.6 B LLM and 0.5 B flow-matching model, beating prior cascaded baselines. The unified training removes the need for independent optimization of each stage.

Core claim

Jointly optimizing the speech tokenizer using reconstruction for the flow-matching model, next-token prediction for the LLM and multi-recognition for the reward model creates a discrete token space better tailored to TTS; further optimizing the LLM with downstream reconstruction and recognition objectives then reduces inference mismatch and steers generations toward preferred outputs, yielding consistent gains over independently trained cascaded pipelines.

What carries the argument

The end-to-end optimization framework that applies multi-task objectives to the tokenizer and downstream reconstruction and recognition objectives to the LLM.

If this is right

The unified system consistently outperforms cascaded baselines on TTS tasks.
New state-of-the-art word error rates of 0.78 percent and 1.56 percent are reached on Seed-TTS-Eval.
The gains hold with a 0.6 B-parameter LLM and 0.5 B-parameter flow-matching model.
Training is simplified because the components no longer require separate optimization stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-objective approach could be tested on other discrete-token pipelines such as music or sound-effect generation.
Reducing decoder-LLM mismatch through downstream feedback may apply to any autoregressive generation system that uses a separate decoder.
The resulting token space might also improve performance on downstream speech recognition or understanding tasks.

Load-bearing premise

The joint multi-task objectives on the tokenizer produce a discrete token space measurably better suited to TTS than tokens trained independently.

What would settle it

An ablation that trains the tokenizer independently on the same data and measures whether word error rate on Seed-TTS-Eval rises back to the level of the cascaded baseline.

Figures

Figures reproduced from arXiv: 2606.09234 by Changfeng Gao, Jun Yuan, ShiDong Shang, Ye Bai, Yong Ren, Zhao You.

**Figure 2.** Figure 2: Three stage training for the E2E TTS. Specifically, we replace the ground-truth tokens q1:T with either the Gumbel–Softmax hard samples or the hidden states of the LLM, denoted as h LM 1:T : q ′ t = CodeBook( GumbelSoftmax(p(ct | c1:t−1))) or h LM t . (10) Accordingly, the reconstruction and flow-matching losses computed using LLM-predicted tokens are defined as LLRM and LLFM, respectively. The combination… view at source ↗

**Figure 3.** Figure 3: Distributions for speech token trained with different loss functions on the LibriTTS [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a joint multi-task recipe for training the tokenizer and LLM in discrete-token TTS and reports new low WER numbers, but the central claim about improved token space lacks isolating ablations.

read the letter

The main takeaway is that this work adds a reward model and runs multi-task objectives on the tokenizer (reconstruction, next-token prediction, multi-recognition) before doing downstream LLM fine-tuning, then shows lower WER than cascaded baselines on Seed-TTS-Eval.

What is new is the specific combination of those joint objectives plus the LLM optimization step to cut inference mismatch. The pipeline structure itself follows prior discrete-token TTS work, so the contribution sits in the training schedule rather than new components.

The results look practically useful: 0.78% and 1.56% WER with 0.6B LLM and 0.5B FM models is a clear step up from the cited baselines. The paper does a reasonable job describing how the multi-task setup is meant to steer tokens toward acoustically and semantically useful representations.

The soft spot is exactly the one flagged in the stress test. The abstract asserts that the joint objectives produce a token space better suited to TTS, yet supplies no control that trains an otherwise identical tokenizer on single objectives and measures the downstream WER difference. Without that isolation, the gains could trace to the reward model, longer training, or other unmentioned factors. No error bars or significance tests are mentioned either.

This is incremental empirical work aimed at people already building discrete-token TTS systems. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee even though the evidence for the headline mechanism is incomplete. I would send it to review and expect the main questions to focus on ablations and statistical support for the numbers.

Referee Report

2 major / 2 minor

Summary. The paper proposes a fully end-to-end optimization framework for discrete-token LLM-based TTS that jointly trains the speech tokenizer on multi-task objectives (reconstruction for the flow-matching model, next-token prediction for the LLM, and multi-recognition for an added reward model), then further optimizes the LLM using downstream reconstruction and recognition signals. This is claimed to produce tokens better tailored to TTS and reduce inference-time mismatch, yielding consistent outperformance of cascaded baselines and new SOTA WERs of 0.78% and 1.56% on Seed-TTS-Eval using a 0.6B LLM and 0.5B FM model.

Significance. If the central empirical claims hold after proper controls, the work would demonstrate that holistic joint training can improve discrete-token TTS performance while simplifying the pipeline relative to independent component training, providing evidence that multi-task tokenization and downstream LLM steering are effective for reducing mismatch in cascaded systems.

major comments (2)

[Abstract / Experimental results] Abstract and experimental results: the headline claim that joint multi-task training of the tokenizer yields a discrete token space measurably better suited to TTS (producing the reported SOTA WER) is load-bearing yet unsupported by any ablation that trains an otherwise identical tokenizer on single objectives or independent stages and measures the resulting WER delta when the same LLM+FM are attached.
[Abstract] Abstract: the reported WER improvements of 0.78% and 1.56% are presented without quantitative ablations isolating the contribution of each objective, without error bars, and without statistical tests comparing E2E versus cascaded baselines.

minor comments (2)

The manuscript should clarify the precise training schedule, data splits, and hyperparameter selection procedure to allow assessment of whether evaluation metrics influenced model selection.
Details on the architecture and training of the reward model (RM) are insufficient for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below and will revise the paper to incorporate additional experiments and analysis as needed.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and experimental results: the headline claim that joint multi-task training of the tokenizer yields a discrete token space measurably better suited to TTS (producing the reported SOTA WER) is load-bearing yet unsupported by any ablation that trains an otherwise identical tokenizer on single objectives or independent stages and measures the resulting WER delta when the same LLM+FM are attached.

Authors: We agree that a direct ablation isolating multi-task tokenizer training versus single-objective or independent-stage training would provide stronger support for the claim that the joint objectives produce a token space better suited to TTS. The current manuscript compares the full E2E system against cascaded baselines (independently trained components), which provides indirect evidence, but does not include the requested tokenizer-specific ablation with fixed LLM+FM. We will add these ablations in the revision, training otherwise identical tokenizers on single objectives and reporting the resulting WER deltas. revision: yes
Referee: [Abstract] Abstract: the reported WER improvements of 0.78% and 1.56% are presented without quantitative ablations isolating the contribution of each objective, without error bars, and without statistical tests comparing E2E versus cascaded baselines.

Authors: We acknowledge that the reported WER figures lack error bars, statistical tests, and per-objective contribution ablations. In the revised manuscript we will report results across multiple random seeds with error bars, include statistical significance tests (e.g., paired t-tests) against the cascaded baselines, and expand the ablation section to quantify the contribution of each training objective (reconstruction, next-token prediction, and recognition). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training claims with no derivations or self-referential reductions

full rationale

The paper describes an end-to-end training procedure for a TTS system using joint multi-task objectives on a tokenizer, followed by LLM optimization, and reports measured WER improvements on Seed-TTS-Eval. No equations, derivations, or first-principles results are present. The central claim (joint objectives yield a better token space) is an empirical assertion supported by benchmark numbers rather than any definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The result remains externally falsifiable via independent replication or ablation, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the described multi-task objectives. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1339 out tokens · 19255 ms · 2026-06-27T15:20:20.182602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Asq: An ultra-low bit rate asr-oriented speech quantization method

Lingxuan Ye, Changfeng Gao, Gaofeng Cheng, Liuping Luo, and Qingwei Zhao. Asq: An ultra-low bit rate asr-oriented speech quantization method. IEEE Signal Processing Letters, 31:221–225, 2023

2023
[5]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,

Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. CoRR, abs/2409.03283, 2024

work page arXiv 2024
[7]

Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis

Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, et al. Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis. arXiv preprint arXiv:2512.19090, 2025

work page arXiv 2025
[8]

Differentiable reward optimization for llm based tts system

Changfeng Gao, Zhihao Du, and Shiliang Zhang. Differentiable reward optimization for llm based tts system. arXiv preprint arXiv:2507.05911, 2025

work page arXiv 2025
[9]

Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations

Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Hsin-Min Wang, and Yu Tsao. Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations. In 2024 Asia Pacific Signal and Information Processing Associa- tion Annual Summit and Conference...

2024
[10]

Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs

Wei-Cheng Tseng and David Harwath. Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs. arXiv preprint arXiv:2511.16639, 2025

work page arXiv 2025
[11]

Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling

Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, and Zhizheng Wu. Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling. Advances in Neural Information Processing Systems, 38:147494–147523, 2026

2026
[12]

Scaling speech tokenizers with diffusion autoencoders

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, et al. Scaling speech tokenizers with diffusion autoencoders. arXiv preprint arXiv:2602.06602, 2026

work page arXiv 2026
[13]

Finite scalar quantization: VQ-VAE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. In ICLR. OpenReview.net, 2024. 10

2024
[14]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text- to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

work page arXiv 2024
[15]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion. arXiv preprint arXiv:2504.02407, 2025

work page arXiv 2025
[17]

Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching. arXiv preprint arXiv:2506.13053, 2025

work page arXiv 2025
[18]

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, and Daniel Povey. Omnivoice: Towards omnilingual zero- shot text-to-speech with diffusion language models. arXiv preprint arXiv:2604.00688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An eﬀicient llm- based text-to-speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Faster whisper large v3, 2023

Systran. Faster whisper large v3, 2023

2023
[22]

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms

Tongyi Speech Team. Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024

2024
[23]

Tera: Self-supervised learning of trans- former encoder representation for speech

Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of trans- former encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021

2021
[24]

vq-wav2vec: Self-supervised learn- ing of discrete speech representations

Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learn- ing of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019. 11

work page arXiv 1910

[1] [1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Asq: An ultra-low bit rate asr-oriented speech quantization method

Lingxuan Ye, Changfeng Gao, Gaofeng Cheng, Liuping Luo, and Qingwei Zhao. Asq: An ultra-low bit rate asr-oriented speech quantization method. IEEE Signal Processing Letters, 31:221–225, 2023

2023

[5] [5]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,

Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications. CoRR, abs/2409.03283, 2024

work page arXiv 2024

[7] [7]

Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis

Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, et al. Joyvoice: Long-context condition- ing for anthropomorphic multi-speaker conversational synthesis. arXiv preprint arXiv:2512.19090, 2025

work page arXiv 2025

[8] [8]

Differentiable reward optimization for llm based tts system

Changfeng Gao, Zhihao Du, and Shiliang Zhang. Differentiable reward optimization for llm based tts system. arXiv preprint arXiv:2507.05911, 2025

work page arXiv 2025

[9] [9]

Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations

Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Hsin-Min Wang, and Yu Tsao. Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations. In 2024 Asia Pacific Signal and Information Processing Associa- tion Annual Summit and Conference...

2024

[10] [10]

Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs

Wei-Cheng Tseng and David Harwath. Codec2vec: Self-supervised speech representa- tion learning using neural speech codecs. arXiv preprint arXiv:2511.16639, 2025

work page arXiv 2025

[11] [11]

Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling

Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, and Zhizheng Wu. Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling. Advances in Neural Information Processing Systems, 38:147494–147523, 2026

2026

[12] [12]

Scaling speech tokenizers with diffusion autoencoders

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, et al. Scaling speech tokenizers with diffusion autoencoders. arXiv preprint arXiv:2602.06602, 2026

work page arXiv 2026

[13] [13]

Finite scalar quantization: VQ-VAE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. In ICLR. OpenReview.net, 2024. 10

2024

[14] [14]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text- to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

work page arXiv 2024

[15] [15]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r- tts: Improving flow matching based text-to-speech with group relative policy optimiza- tion. arXiv preprint arXiv:2504.02407, 2025

work page arXiv 2025

[17] [17]

Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. Zipvoice: Fast and high-quality zero-shot text- to-speech with flow matching. arXiv preprint arXiv:2506.13053, 2025

work page arXiv 2025

[18] [18]

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, and Daniel Povey. Omnivoice: Towards omnilingual zero- shot text-to-speech with diffusion language models. arXiv preprint arXiv:2604.00688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An eﬀicient llm- based text-to-speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Faster whisper large v3, 2023

Systran. Faster whisper large v3, 2023

2023

[22] [22]

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms

Tongyi Speech Team. Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024

2024

[23] [23]

Tera: Self-supervised learning of trans- former encoder representation for speech

Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of trans- former encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021

2021

[24] [24]

vq-wav2vec: Self-supervised learn- ing of discrete speech representations

Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learn- ing of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019. 11

work page arXiv 1910