{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:MXBMD6Y22XYGBJPHRFNFY7KVTH","short_pith_number":"pith:MXBMD6Y2","schema_version":"1.0","canonical_sha256":"65c2c1fb1ad5f060a5e7895a5c7d5599cb7cd4c8c452f2ddd5afe4cc33bc702a","source":{"kind":"arxiv","id":"2601.15621","version":1},"attestation_state":"computed","paper":{"title":"Qwen3-TTS Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.","cross_cats":["cs.CL","eess.AS"],"primary_cat":"cs.SD","authors_text":"Baosong Yang, Bin Zhang, Dake Guo, Hangrui Hu, Hongkun Hao, Jingren Zhou, Jin Xu, Junyang Lin, Pei Zhang, Ting He, Xinfa Zhu, Xinyu Zhang, Xiong Wang, Zhifang Guo, Zishan Guo, Ziyue Jiang","submitted_at":"2026-01-22T03:51:43Z","abstract_excerpt":"In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2601.15621","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.SD","submitted_at":"2026-01-22T03:51:43Z","cross_cats_sorted":["cs.CL","eess.AS"],"title_canon_sha256":"d99480f1d84569a33dfe616970481a3a7b0dd54aba79c8dbaf4086a6d7aa9619","abstract_canon_sha256":"43e7a156c135547a462f7739d8e3537a2f0d147f2c0f63d766e748add5dabc39"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.851807Z","signature_b64":"EXOgjVqiX897bK5pvn9Wmg0JMvgV5D2HIGmFERKJcJnvGdlSndfzR2/ZSn43bUwnhcDugBFlwjUcDbPmsABdCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"65c2c1fb1ad5f060a5e7895a5c7d5599cb7cd4c8c452f2ddd5afe4cc33bc702a","last_reissued_at":"2026-05-17T23:38:46.851157Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.851157Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Qwen3-TTS Technical Report","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.","cross_cats":["cs.CL","eess.AS"],"primary_cat":"cs.SD","authors_text":"Baosong Yang, Bin Zhang, Dake Guo, Hangrui Hu, Hongkun Hao, Jingren Zhou, Jin Xu, Junyang Lin, Pei Zhang, Ting He, Xinfa Zhu, Xinyu Zhang, Xiong Wang, Zhifang Guo, Zishan Guo, Ziyue Jiang","submitted_at":"2026-01-22T03:51:43Z","abstract_excerpt":"In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set).","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen benchmarks and subjective tests accurately reflect real-world multilingual use cases and that the 5 million hours of training data contain no systematic quality or bias issues that would degrade performance outside the reported evaluations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0398d329f77caba5dbf2a9c435a07904d3666dc4758b727d9aae850aeafb3b8e"},"source":{"id":"2601.15621","kind":"arxiv","version":1},"verdict":{"id":"5fe5248e-f534-4acd-8b3f-c3acc242d404","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T19:20:37.829335Z","strongest_claim":"Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set).","one_line_summary":"Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen benchmarks and subjective tests accurately reflect real-world multilingual use cases and that the 5 million hours of training data contain no systematic quality or bias issues that would degrade performance outside the reported evaluations.","pith_extraction_headline":"Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming."},"references":{"count":26,"sample":[{"doi":"","year":null,"title":"Seed-TTS: A Family of High-Quality Versatile Speech Generation Models","work_id":"6e88ee95-1133-4302-a142-cdf8f9456a8d","ref_index":1,"cited_arxiv_id":"2406.02430","is_internal_anchor":true},{"doi":"","year":null,"title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","work_id":"2a9a61a0-5461-4302-8659-788b84ecca31","ref_index":2,"cited_arxiv_id":"2410.06885","is_internal_anchor":true},{"doi":"","year":null,"title":"High Fidelity Neural Audio Compression","work_id":"bc645d2d-e9f2-4cb8-9a6d-bd557bc7a258","ref_index":3,"cited_arxiv_id":"2210.13438","is_internal_anchor":true},{"doi":"","year":null,"title":"Moshi: a speech-text foundation model for real-time dialogue","work_id":"3104332b-d279-44c8-aaa7-3d5a13c01832","ref_index":4,"cited_arxiv_id":"2410.00037","is_internal_anchor":true},{"doi":"","year":null,"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","work_id":"e5ad925a-4045-49b5-b301-208bcbf3eca8","ref_index":5,"cited_arxiv_id":"2407.05407","is_internal_anchor":true}],"resolved_work":26,"snapshot_sha256":"8e6a9c2a3da97a9f2de4cb91c89f10e18bf1f3768badff05cc640ec77e5e2e55","internal_anchors":8},"formal_canon":{"evidence_count":1,"snapshot_sha256":"5ee9999b78804e6981344a7be3ca4ede20245597730024c69ce0378a1b49234e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2601.15621","created_at":"2026-05-17T23:38:46.851252+00:00"},{"alias_kind":"arxiv_version","alias_value":"2601.15621v1","created_at":"2026-05-17T23:38:46.851252+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2601.15621","created_at":"2026-05-17T23:38:46.851252+00:00"},{"alias_kind":"pith_short_12","alias_value":"MXBMD6Y22XYG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MXBMD6Y22XYGBJPH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MXBMD6Y2","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":21,"internal_anchor_count":21,"sample":[{"citing_arxiv_id":"2605.20830","citing_title":"Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09568","citing_title":"RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17488","citing_title":"Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.00688","citing_title":"OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27476","citing_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26347","citing_title":"The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26136","citing_title":"One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09568","citing_title":"RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09386","citing_title":"Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23586","citing_title":"Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22225","citing_title":"TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10580","citing_title":"Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11594","citing_title":"HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10065","citing_title":"ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08363","citing_title":"CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06765","citing_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","ref_index":113,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22821","citing_title":"Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16211","citing_title":"NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17958","citing_title":"MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech","ref_index":30,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH","json":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH.json","graph_json":"https://pith.science/api/pith-number/MXBMD6Y22XYGBJPHRFNFY7KVTH/graph.json","events_json":"https://pith.science/api/pith-number/MXBMD6Y22XYGBJPHRFNFY7KVTH/events.json","paper":"https://pith.science/paper/MXBMD6Y2"},"agent_actions":{"view_html":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH","download_json":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH.json","view_paper":"https://pith.science/paper/MXBMD6Y2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2601.15621&json=true","fetch_graph":"https://pith.science/api/pith-number/MXBMD6Y22XYGBJPHRFNFY7KVTH/graph.json","fetch_events":"https://pith.science/api/pith-number/MXBMD6Y22XYGBJPHRFNFY7KVTH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH/action/storage_attestation","attest_author":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH/action/author_attestation","sign_citation":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH/action/citation_signature","submit_replication":"https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH/action/replication_record"}},"created_at":"2026-05-17T23:38:46.851252+00:00","updated_at":"2026-05-17T23:38:46.851252+00:00"}