{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:HCXY62YPHCZQYYPT3MZO5JI6SM","short_pith_number":"pith:HCXY62YP","schema_version":"1.0","canonical_sha256":"38af8f6b0f38b30c61f3db32eea51e9318757ac25b5f5667b854c69832896505","source":{"kind":"arxiv","id":"2407.05407","version":2},"attestation_state":"computed","paper":{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice.","cross_cats":["cs.AI","eess.AS"],"primary_cat":"cs.SD","authors_text":"Hangrui Hu, Heng Lu, Kai Hu, Qian Chen, Shiliang Zhang, Siqi Zheng, Yexin Yang, Yue Gu, Zhifu Gao, Zhihao Du, Zhijie Yan, Ziyang Ma","submitted_at":"2024-07-07T15:16:19Z","abstract_excerpt":"Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.05407","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.SD","submitted_at":"2024-07-07T15:16:19Z","cross_cats_sorted":["cs.AI","eess.AS"],"title_canon_sha256":"c94cc54b5b0e4e4aacad550fe4d4f44213a7a0bed116877c57b077199304a3b1","abstract_canon_sha256":"c81bb0c9503acba49fc092e8e606bd94db76a2fc79ae9ff5f31ad8ba58b9d6ec"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.817690Z","signature_b64":"kHNmVjvinEXOGUi9nklGKH+M09Yrk1UwGcVBDXBw1U1BuX1+2tVaOxh8BBFdkz3fqWalsC4jooFNEm0ArdXYCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"38af8f6b0f38b30c61f3db32eea51e9318757ac25b5f5667b854c69832896505","last_reissued_at":"2026-05-17T23:38:50.817252Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.817252Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice.","cross_cats":["cs.AI","eess.AS"],"primary_cat":"cs.SD","authors_text":"Hangrui Hu, Heng Lu, Kai Hu, Qian Chen, Shiliang Zhang, Siqi Zheng, Yexin Yang, Yue Gu, Zhifu Gao, Zhihao Du, Zhijie Yan, Ziyang Ma","submitted_at":"2024-07-07T15:16:19Z","abstract_excerpt":"Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Inserting vector quantization into the multilingual ASR encoder produces tokens that retain sufficient semantic, acoustic, and prosodic information for high-quality reconstruction by the conditional flow matching model without major loss.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Supervised semantic tokens from ASR enable CosyVoice to outperform unsupervised tokens in zero-shot multilingual TTS via LLM text-to-token and flow-matching token-to-speech synthesis.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"be57b4bfd5676f70ba50f6d5b9bc37cc2c707b76ac0679f8b8e33c33a123f5f3"},"source":{"id":"2407.05407","kind":"arxiv","version":2},"verdict":{"id":"07249c92-9077-441a-a1a9-2e2784331e63","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T16:56:19.725726Z","strongest_claim":"supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning","one_line_summary":"Supervised semantic tokens from ASR enable CosyVoice to outperform unsupervised tokens in zero-shot multilingual TTS via LLM text-to-token and flow-matching token-to-speech synthesis.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Inserting vector quantization into the multilingual ASR encoder produces tokens that retain sufficient semantic, acoustic, and prosodic information for high-quality reconstruction by the conditional flow matching model without major loss.","pith_extraction_headline":"Supervised semantic tokens from a multilingual ASR model enable more consistent and similar zero-shot voice cloning than unsupervised tokens in CosyVoice."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"fb6e657ac52cd7a443d2b7d0739a6b895e6e1a46f6b205257b59eb5dcad06585"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.05407","created_at":"2026-05-17T23:38:50.817325+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.05407v2","created_at":"2026-05-17T23:38:50.817325+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.05407","created_at":"2026-05-17T23:38:50.817325+00:00"},{"alias_kind":"pith_short_12","alias_value":"HCXY62YPHCZQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"HCXY62YPHCZQYYPT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"HCXY62YP","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":42,"internal_anchor_count":42,"sample":[{"citing_arxiv_id":"2409.18512","citing_title":"Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2504.08528","citing_title":"On The Landscape of Spoken Language Models: A Comprehensive Survey","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2506.23552","citing_title":"JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03526","citing_title":"Enhancing Speech Large Language Models through Reinforced Behavior Alignment","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03170","citing_title":"TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09568","citing_title":"RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17583","citing_title":"AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16026","citing_title":"From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16964","citing_title":"SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19883","citing_title":"CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2502.11946","citing_title":"Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22220","citing_title":"StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24708","citing_title":"SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2510.01284","citing_title":"Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2410.17196","citing_title":"VoiceBench: Benchmarking LLM-Based Voice Assistants","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2512.14234","citing_title":"ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15621","citing_title":"Qwen3-TTS Technical Report","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2601.22143","citing_title":"JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2410.06885","citing_title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2507.16632","citing_title":"Step-Audio 2 Technical Report","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2505.17589","citing_title":"CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2412.02612","citing_title":"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2603.02364","citing_title":"When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05373","citing_title":"Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.19857","citing_title":"FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts","ref_index":6,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM","json":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM.json","graph_json":"https://pith.science/api/pith-number/HCXY62YPHCZQYYPT3MZO5JI6SM/graph.json","events_json":"https://pith.science/api/pith-number/HCXY62YPHCZQYYPT3MZO5JI6SM/events.json","paper":"https://pith.science/paper/HCXY62YP"},"agent_actions":{"view_html":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM","download_json":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM.json","view_paper":"https://pith.science/paper/HCXY62YP","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.05407&json=true","fetch_graph":"https://pith.science/api/pith-number/HCXY62YPHCZQYYPT3MZO5JI6SM/graph.json","fetch_events":"https://pith.science/api/pith-number/HCXY62YPHCZQYYPT3MZO5JI6SM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM/action/storage_attestation","attest_author":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM/action/author_attestation","sign_citation":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM/action/citation_signature","submit_replication":"https://pith.science/pith/HCXY62YPHCZQYYPT3MZO5JI6SM/action/replication_record"}},"created_at":"2026-05-17T23:38:50.817325+00:00","updated_at":"2026-05-17T23:38:50.817325+00:00"}