{"work":{"id":"31f99dad-40ae-4a19-aeff-eafa54f5b42a","openalex_id":null,"doi":null,"arxiv_id":"2503.01710","raw_key":null,"title":"Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens","authors":null,"authors_text":"Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li","year":2025,"venue":"cs.SD","abstract":"Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.","external_url":"https://arxiv.org/abs/2503.01710","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-19T04:32:03.667489+00:00","pith_arxiv_id":"2503.01710","created_at":"2026-05-10T04:04:47.228902+00:00","updated_at":"2026-05-19T04:32:03.667489+00:00","title_quality_ok":true,"display_title":"Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens","render_title":"Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens"},"hub":{"state":{"work_id":"31f99dad-40ae-4a19-aeff-eafa54f5b42a","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":18,"external_cited_by_count":null,"distinct_field_count":4,"first_pith_cited_at":"2025-05-23T07:55:21+00:00","last_pith_cited_at":"2026-05-11T18:04:33+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-19T09:31:30.287771+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"baseline","n":1},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":4},{"context_polarity":"baseline","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}