{"paper":{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice.","cross_cats":["cs.AI","eess.AS"],"primary_cat":"cs.SD","authors_text":"Hangrui Hu, Heng Lu, Kai Hu, Qian Chen, Shiliang Zhang, Siqi Zheng, Yexin Yang, Yue Gu, Zhifu Gao, Zhihao Du, Zhijie Yan, Ziyang Ma","submitted_at":"2024-07-07T15:16:19Z","abstract_excerpt":"Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Inserting vector quantization into the multilingual ASR encoder produces tokens that retain sufficient semantic, acoustic, and prosodic information for high-quality reconstruction by the conditional flow matching model without major loss.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Supervised semantic tokens from ASR enable CosyVoice to outperform unsupervised tokens in zero-shot multilingual TTS via LLM text-to-token and flow-matching token-to-speech synthesis.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"be57b4bfd5676f70ba50f6d5b9bc37cc2c707b76ac0679f8b8e33c33a123f5f3"},"source":{"id":"2407.05407","kind":"arxiv","version":2},"verdict":{"id":"07249c92-9077-441a-a1a9-2e2784331e63","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T16:56:19.725726Z","strongest_claim":"supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning","one_line_summary":"Supervised semantic tokens from ASR enable CosyVoice to outperform unsupervised tokens in zero-shot multilingual TTS via LLM text-to-token and flow-matching token-to-speech synthesis.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Inserting vector quantization into the multilingual ASR encoder produces tokens that retain sufficient semantic, acoustic, and prosodic information for high-quality reconstruction by the conditional flow matching model without major loss.","pith_extraction_headline":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"fb6e657ac52cd7a443d2b7d0739a6b895e6e1a46f6b205257b59eb5dcad06585"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}