arxiv: 2601.15621 · v1 · submitted 2026-01-22 · 💻 cs.SD · cs.CL· eess.AS

Recognition: 1 theorem link

· Lean Theorem

Qwen3-TTS Technical Report

Hangrui Hu , Xinfa Zhu , Ting He , Dake Guo , Bin Zhang , Xiong Wang , Zhifang Guo , Ziyue Jiang

show 8 more authors

Hongkun Hao Zishan Guo Xinyu Zhang Pei Zhang Baosong Yang Jin Xu Jingren Zhou Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:20 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords multilingual TTSvoice cloningstreaming synthesisspeech tokenizersdual-track LMcontrollable TTSlow-latency TTS

0 comments

The pith

Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Qwen3-TTS series of models for multilingual, controllable, and streaming text-to-speech. These models support voice cloning from only 3 seconds of audio and allow control through natural language descriptions. Training on over 5 million hours of speech data across 10 languages uses a dual-track language model architecture. Two custom tokenizers handle semantic content at 25 Hz and extreme compression at 12 Hz to support real-time output. Reported results lead on multiple objective and subjective benchmarks including multilingual and long-speech evaluations.

Core claim

Qwen3-TTS reaches state-of-the-art performance in multilingual text-to-speech by training a dual-track LM architecture on more than 5 million hours of data spanning 10 languages, combined with a 25 Hz tokenizer for semantic integration and a 12 Hz tokenizer that enables 97 ms first-packet streaming latency.

What carries the argument

Dual-track LM architecture paired with Qwen-TTS-Tokenizer-25Hz for semantic content and seamless integration, plus Qwen-TTS-Tokenizer-12Hz for bitrate reduction and immediate causal streaming via lightweight ConvNet.

If this is right

Enables creation of entirely novel voices and fine-grained manipulation through description-based control.
Supports block-wise waveform reconstruction for streaming output via the 25 Hz tokenizer.
Achieves immediate first-packet emission at 97 ms latency with the 12 Hz tokenizer design.
Facilitates further research through open release of both tokenizers and models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low-latency tokenizer could support deployment in conversational systems requiring immediate response.
Description-based control may reduce reliance on reference audio for custom voice creation.
Scaling the dual-track approach to additional languages would depend on acquiring comparable volumes of clean training data.

Load-bearing premise

The selected benchmarks and subjective evaluations accurately represent performance in diverse real-world multilingual conditions without systematic issues in the training data.

What would settle it

A controlled test on an eleventh language or in noisy real-world audio where the models fall below current leading baselines on objective metrics would disprove the state-of-the-art claim.

read the original abstract

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3-TTS gives a practical streaming TTS setup with two new tokenizers and open weights, but the SOTA claims rest on internal benchmarks without public numbers or external baselines.

read the letter

The useful part is the engineering for low-latency streaming. They pair a dual-track LM with a 25 Hz single-codebook semantic tokenizer that works with their audio models and a 12.5 Hz multi-codebook causal tokenizer that gets the first packet out at 97 ms. Training on 5 million hours across 10 languages plus 3-second voice cloning and description control makes it ready for real deployment, and releasing the tokenizers and models under Apache 2.0 lets others test it directly.

Referee Report

2 major / 2 minor

Summary. This technical report presents the Qwen3-TTS family of multilingual TTS models. Trained on more than 5 million hours of speech data across 10 languages, the models employ a dual-track LM architecture with two tokenizers: Qwen-TTS-Tokenizer-25Hz for semantic content and Qwen-TTS-Tokenizer-12Hz for low-latency streaming. The work claims state-of-the-art results on benchmarks such as the TTS multilingual test set, InstructTTSEval, and an internal long speech test set, while supporting voice cloning and controllable synthesis. The tokenizers and models are released under Apache 2.0.

Significance. Should the performance claims be validated through detailed comparisons, the contribution would be notable for advancing open multilingual TTS with streaming capabilities and integration potential with audio models. The scale of training data and the dual tokenizer approach could influence future work in controllable and real-time speech synthesis.

major comments (2)

[Abstract] Abstract: The assertion of state-of-the-art performance lacks any supporting quantitative data, such as specific metric values, comparisons to named baseline systems, or statistical significance measures. This omission prevents independent verification of the central claim.
[Abstract] Abstract (long speech test set): The reliance on 'our long speech test set' for SOTA claims introduces a risk of selection bias, as no information is provided on its construction, independence from the 5M-hour training corpus, or comparison to external long-form datasets. This is load-bearing for the generalization of the reported results.

minor comments (2)

[Abstract] Grammatical error: 'diverse objective and subjective benchmark' should be 'benchmarks'.
[Abstract] Typo: 'seamlessly integration' should be 'seamless integration'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires strengthening with quantitative support and additional details on the long speech test set. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of state-of-the-art performance lacks any supporting quantitative data, such as specific metric values, comparisons to named baseline systems, or statistical significance measures. This omission prevents independent verification of the central claim.

Authors: We agree that the abstract should include key quantitative results. In the revised version, we will add specific metric values (e.g., objective scores on the TTS multilingual test set and InstructTTSEval), named baseline comparisons, and references to statistical significance from the experimental results already detailed in the main body. revision: yes
Referee: [Abstract] Abstract (long speech test set): The reliance on 'our long speech test set' for SOTA claims introduces a risk of selection bias, as no information is provided on its construction, independence from the 5M-hour training corpus, or comparison to external long-form datasets. This is load-bearing for the generalization of the reported results.

Authors: We acknowledge the concern. The manuscript already describes the long speech test set in the experiments section, including its scale and separation from training data. We will add a concise summary of its construction, independence, and relation to external long-form datasets directly into the abstract to address potential bias. revision: yes

Circularity Check

0 steps flagged

Empirical training report with no derivation chain or self-referential reduction

full rationale

The document is a technical report on model training and benchmarking rather than a mathematical derivation. No equations, ansatzes, or closed-form predictions are present that could reduce outputs to inputs by construction. The sole potential concern is the self-constructed 'long speech test set' referenced in the abstract for SOTA claims, but this is an evaluation detail rather than a load-bearing step in any derivation; it does not create circularity under the enumerated patterns. The work remains self-contained as an empirical contribution with external benchmarks also cited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit listing of hyperparameters, training objectives, or architectural assumptions beyond the high-level description of dual-track LM and two tokenizers; standard supervised learning assumptions for sequence modeling are implicitly used but not enumerated.

pith-pipeline@v0.9.0 · 5588 in / 1167 out tokens · 78286 ms · 2026-05-16T19:20:37.829335+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
cs.CV 2026-04 unverdicted novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
eess.AS 2026-04 unverdicted novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
cs.SD 2026-04 unverdicted novelty 7.0

NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
cs.CL 2026-04 unverdicted novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
eess.AS 2026-04 unverdicted novelty 6.0

Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
cs.CL 2026-04 unverdicted novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
eess.AS 2026-05 unverdicted novelty 5.0

The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
JaiTTS: A Thai Voice Cloning Model
cs.CL 2026-04 unverdicted novelty 5.0

JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.
JaiTTS: A Thai Voice Cloning Model
cs.CL 2026-04 unverdicted novelty 5.0

JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
eess.AS 2026-04 unverdicted novelty 4.0

A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 17 Pith papers · 7 internal anchors

[1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5- tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv preprint arXiv:2410.06885,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024a. Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, C...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs

Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs. CoRR, abs/2506.23325,

work page arXiv
[7]

Streamflow: Streaming flow matching with block-wise guided attention mask for speech token decoding.CoRR, abs/2506.23986,

Dake Guo, Jixun Yao, Linhan Ma, He Wang, and Lei Xie. Streamflow: Streaming flow matching with block-wise guided attention mask for speech token decoding.CoRR, abs/2506.23986,

work page arXiv
[8]

Prompttts: Controllable text-to-speech with text descriptions

Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. Prompttts: Controllable text-to-speech with text descriptions. InIEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp. 1–5. IEEE,

work page 2023
[9]

†Corresponding Authors

1Alphabetical order. †Corresponding Authors. 12 Jingbin Hu, Huakang Chen, Linhan Ma, Dake Guo, Qirui Zhan, Wenhao Li, Haoyu Zhang, Kangxiang Xia, Ziyu Zhang, Wenjie Tian, et al. Voicesculptor: Your voice, designed by you.arXiv preprint arXiv:2601.10629,

work page arXiv
[10]

Instructttseval: Benchmarking complex natural-language instruction following in text-to-speech systems.CoRR, abs/2506.16381,

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. Instructttseval: Benchmarking complex natural-language instruction following in text-to-speech systems.CoRR, abs/2506.16381,

work page arXiv
[11]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100,

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100,

work page arXiv
[12]

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthe- sis.arXiv preprint arXiv:2411.01156,

URLhttps://arxiv.org/abs/2411.01156. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR

work page arXiv
[13]

Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions

Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and Zhifei Li. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones (eds.),24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, A...

work page 2023
[14]

Natural language guidance of high-fidelity text-to-speech with synthetic annotations.arXiv preprint arXiv:2402.01912,

Dan Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations.arXiv preprint arXiv:2402.01912,

work page arXiv
[15]

Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.CoRR, abs/2508.19205,

work page arXiv
[16]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116,

work page arXiv
[17]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710,

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710,

work page arXiv
[19]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750,

work page arXiv
[20]

Kall-e: Autoregressive speech synthesis with next-distribution prediction.CoRR, abs/2412.16846,

Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, and Lei Xie. Kall-e: Autoregressive speech synthesis with next-distribution prediction.CoRR, abs/2412.16846,

work page arXiv
[21]

Fireredtts-2: Towards long conversa- tional speech generation for podcast and chatbot.CoRR, abs/2509.02020,

13 Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, and Yao Hu. Fireredtts-2: Towards long conversa- tional speech generation for podcast and chatbot.CoRR, abs/2509.02020,

work page arXiv
[22]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Uniaudio: An audio foundation model toward universal audio generation.arXiv preprint arXiv:2310.00704,

Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, et al. Uniaudio: An audio foundation model toward universal audio generation.arXiv preprint arXiv:2310.00704,

work page arXiv
[24]

Codec does matter: Exploring the semantic shortcoming of codec for audio language model

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, and Wei Xue. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphi...

work page arXiv 2025
[25]

Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.arXiv preprint arXiv:2505.07916, 2025a

Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, et al. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.arXiv preprint arXiv:2505.07916, 2025a. Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji...

work page arXiv 2024
[26]

Voxcpm: Tokenizer-free TTS for context-aware speech generation and true-to-life voice cloning.CoRR, abs/2509.24650,

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, and Zhiyuan Liu. Voxcpm: Tokenizer-free TTS for context-aware speech generation and true-to-life voice cloning.CoRR, abs/2509.24650,

work page arXiv