Recognition: 1 theorem link
· Lean TheoremQwen3-TTS Technical Report
Pith reviewed 2026-05-16 19:20 UTC · model grok-4.3
The pith
Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen3-TTS reaches state-of-the-art performance in multilingual text-to-speech by training a dual-track LM architecture on more than 5 million hours of data spanning 10 languages, combined with a 25 Hz tokenizer for semantic integration and a 12 Hz tokenizer that enables 97 ms first-packet streaming latency.
What carries the argument
Dual-track LM architecture paired with Qwen-TTS-Tokenizer-25Hz for semantic content and seamless integration, plus Qwen-TTS-Tokenizer-12Hz for bitrate reduction and immediate causal streaming via lightweight ConvNet.
If this is right
- Enables creation of entirely novel voices and fine-grained manipulation through description-based control.
- Supports block-wise waveform reconstruction for streaming output via the 25 Hz tokenizer.
- Achieves immediate first-packet emission at 97 ms latency with the 12 Hz tokenizer design.
- Facilitates further research through open release of both tokenizers and models.
Where Pith is reading between the lines
- The low-latency tokenizer could support deployment in conversational systems requiring immediate response.
- Description-based control may reduce reliance on reference audio for custom voice creation.
- Scaling the dual-track approach to additional languages would depend on acquiring comparable volumes of clean training data.
Load-bearing premise
The selected benchmarks and subjective evaluations accurately represent performance in diverse real-world multilingual conditions without systematic issues in the training data.
What would settle it
A controlled test on an eleventh language or in noisy real-world audio where the models fall below current leading baselines on objective metrics would disprove the state-of-the-art claim.
read the original abstract
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This technical report presents the Qwen3-TTS family of multilingual TTS models. Trained on more than 5 million hours of speech data across 10 languages, the models employ a dual-track LM architecture with two tokenizers: Qwen-TTS-Tokenizer-25Hz for semantic content and Qwen-TTS-Tokenizer-12Hz for low-latency streaming. The work claims state-of-the-art results on benchmarks such as the TTS multilingual test set, InstructTTSEval, and an internal long speech test set, while supporting voice cloning and controllable synthesis. The tokenizers and models are released under Apache 2.0.
Significance. Should the performance claims be validated through detailed comparisons, the contribution would be notable for advancing open multilingual TTS with streaming capabilities and integration potential with audio models. The scale of training data and the dual tokenizer approach could influence future work in controllable and real-time speech synthesis.
major comments (2)
- [Abstract] Abstract: The assertion of state-of-the-art performance lacks any supporting quantitative data, such as specific metric values, comparisons to named baseline systems, or statistical significance measures. This omission prevents independent verification of the central claim.
- [Abstract] Abstract (long speech test set): The reliance on 'our long speech test set' for SOTA claims introduces a risk of selection bias, as no information is provided on its construction, independence from the 5M-hour training corpus, or comparison to external long-form datasets. This is load-bearing for the generalization of the reported results.
minor comments (2)
- [Abstract] Grammatical error: 'diverse objective and subjective benchmark' should be 'benchmarks'.
- [Abstract] Typo: 'seamlessly integration' should be 'seamless integration'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract requires strengthening with quantitative support and additional details on the long speech test set. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of state-of-the-art performance lacks any supporting quantitative data, such as specific metric values, comparisons to named baseline systems, or statistical significance measures. This omission prevents independent verification of the central claim.
Authors: We agree that the abstract should include key quantitative results. In the revised version, we will add specific metric values (e.g., objective scores on the TTS multilingual test set and InstructTTSEval), named baseline comparisons, and references to statistical significance from the experimental results already detailed in the main body. revision: yes
-
Referee: [Abstract] Abstract (long speech test set): The reliance on 'our long speech test set' for SOTA claims introduces a risk of selection bias, as no information is provided on its construction, independence from the 5M-hour training corpus, or comparison to external long-form datasets. This is load-bearing for the generalization of the reported results.
Authors: We acknowledge the concern. The manuscript already describes the long speech test set in the experiments section, including its scale and separation from training data. We will add a concise summary of its construction, independence, and relation to external long-form datasets directly into the abstract to address potential bias. revision: yes
Circularity Check
Empirical training report with no derivation chain or self-referential reduction
full rationale
The document is a technical report on model training and benchmarking rather than a mathematical derivation. No equations, ansatzes, or closed-form predictions are present that could reduce outputs to inputs by construction. The sole potential concern is the self-constructed 'long speech test set' referenced in the abstract for SOTA claims, but this is an evaluation detail rather than a load-bearing step in any derivation; it does not create circularity under the enumerated patterns. The work remains self-contained as an empirical contribution with external benchmarks also cited.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
-
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
-
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.
Reference graph
Works this paper leans on
-
[1]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5- tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv preprint arXiv:2410.06885,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
High Fidelity Neural Audio Compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024a. Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, C...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs
Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs. CoRR, abs/2506.23325,
-
[7]
Dake Guo, Jixun Yao, Linhan Ma, He Wang, and Lei Xie. Streamflow: Streaming flow matching with block-wise guided attention mask for speech token decoding.CoRR, abs/2506.23986,
-
[8]
Prompttts: Controllable text-to-speech with text descriptions
Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. Prompttts: Controllable text-to-speech with text descriptions. InIEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp. 1–5. IEEE,
work page 2023
-
[9]
1Alphabetical order. †Corresponding Authors. 12 Jingbin Hu, Huakang Chen, Linhan Ma, Dake Guo, Qirui Zhan, Wenhao Li, Haoyu Zhang, Kangxiang Xia, Ziyu Zhang, Wenjie Tian, et al. Voicesculptor: Your voice, designed by you.arXiv preprint arXiv:2601.10629,
-
[10]
Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. Instructttseval: Benchmarking complex natural-language instruction following in text-to-speech systems.CoRR, abs/2506.16381,
-
[11]
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100,
-
[12]
URLhttps://arxiv.org/abs/2411.01156. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR
-
[13]
Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions
Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and Zhifei Li. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones (eds.),24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, A...
work page 2023
-
[14]
Dan Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations.arXiv preprint arXiv:2402.01912,
-
[15]
Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.CoRR, abs/2508.19205,
-
[16]
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116,
-
[17]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710,
-
[19]
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750,
-
[20]
Kall-e: Autoregressive speech synthesis with next-distribution prediction.CoRR, abs/2412.16846,
Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, and Lei Xie. Kall-e: Autoregressive speech synthesis with next-distribution prediction.CoRR, abs/2412.16846,
-
[21]
13 Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, and Yao Hu. Fireredtts-2: Towards long conversa- tional speech generation for podcast and chatbot.CoRR, abs/2509.02020,
-
[22]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, et al. Uniaudio: An audio foundation model toward universal audio generation.arXiv preprint arXiv:2310.00704,
-
[24]
Codec does matter: Exploring the semantic shortcoming of codec for audio language model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, and Wei Xue. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphi...
-
[25]
Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, et al. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.arXiv preprint arXiv:2505.07916, 2025a. Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji...
-
[26]
Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, and Zhiyuan Liu. Voxcpm: Tokenizer-free TTS for context-aware speech generation and true-to-life voice cloning.CoRR, abs/2509.24650,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.