{"paper":{"title":"Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single-stream speech codec decouples content from speaker traits to let an LLM deliver both zero-shot cloning and fine voice control.","cross_cats":["cs.AI","eess.AS"],"primary_cat":"cs.SD","authors_text":"Jiahao Pan, Lei Xie, Linqin Li, Liumeng Xue, Mingqi Jiang, Pengcheng Zhu, Qixi Zheng, Ruibin Yuan, Rui Wang, Sitong Cheng, Songxiang Liu, Wei Xue, Weizhen Bian, Xiaoqin Feng, Xie Chen, Xinfa Zhu, Xinsheng Wang, Yike Guo, Yunlin Chen, Zheng Liang, Zhen Ye, Zhifei Li, Zhixian Zhao, Ziyang Ma, Ziyu Zhang","submitted_at":"2025-03-03T16:23:10Z","abstract_excerpt":"Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disent"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That BiCodec's decomposition into semantic and global tokens provides clean, independent control over linguistic content and speaker attributes without quality loss or unwanted interactions between the two token streams.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Spark-TTS uses BiCodec single-stream decoupled tokens and Qwen2.5 LLM with CoT to deliver efficient state-of-the-art zero-shot voice cloning and fine-grained voice control.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single-stream speech codec decouples content from speaker traits to let an LLM deliver both zero-shot cloning and fine voice control.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"821fcf90e3a796fcae42cb751a83e58a25291f240cc683e8832a811e01dc0f32"},"source":{"id":"2503.01710","kind":"arxiv","version":1},"verdict":{"id":"70670b8e-920a-4c16-96f7-244a6aca6369","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T09:25:47.089995Z","strongest_claim":"Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis.","one_line_summary":"Spark-TTS uses BiCodec single-stream decoupled tokens and Qwen2.5 LLM with CoT to deliver efficient state-of-the-art zero-shot voice cloning and fine-grained voice control.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That BiCodec's decomposition into semantic and global tokens provides clean, independent control over linguistic content and speaker attributes without quality loss or unwanted interactions between the two token streams.","pith_extraction_headline":"A single-stream speech codec decouples content from speaker traits to let an LLM deliver both zero-shot cloning and fine voice control."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"8ab134f5f443d832b2440e17ffae9e7948a6e1aa7b1b9c4e6fba6382c4015da5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}