{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:UZSHTDQ6ER4ZN7L5FZRDYGKWN5","short_pith_number":"pith:UZSHTDQ6","schema_version":"1.0","canonical_sha256":"a664798e1e247996fd7d2e623c19566f7c7d5765e46c5909b3f8f491fc7c573a","source":{"kind":"arxiv","id":"2406.02430","version":1},"attestation_state":"computed","paper":{"title":"Seed-TTS: A Family of High-Quality Versatile Speech Generation Models","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests.","cross_cats":["cs.SD"],"primary_cat":"eess.AS","authors_text":"Chao Yao, Chuang Ding, Chumin Li, Dejian Zhong, Dongya Jia, Feiya Li, Hui Li, Jian Cong, Jian Wu, Jiawei Chen, Jiaxin Li, Jitong Chen, Junjie Pan, Junteng Zhang, Lelai Deng, Lin Liu, Lu Gao, Lu Lu, Mingqing Gong, Peisong Huang, Philip Anastassiou, Qidi Zhang, Qingqing Huang, Shouda Liu, Shuo Zhang, Sichao Liu, Wenjie Zhang, Xiaobin Zhuang, Xiaoyang Li, Xingxing Li, Xin Wang, Xudong Liu, Yang Zhang, Yifeng Yang, Yuanhao Yi, Yuanyuan Huo, Yuanzhe Chen, Yuchen Liu, Yuping Wang, Yuxuan Wang, Zhengxi Liu, Zhen Wei, Zhiying Huang, Zhuo Chen, Zilin Zhao, Ziyi Chen","submitted_at":"2024-06-04T15:48:29Z","abstract_excerpt":"We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is cap"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2406.02430","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"eess.AS","submitted_at":"2024-06-04T15:48:29Z","cross_cats_sorted":["cs.SD"],"title_canon_sha256":"89d4e6461411ffd6c11540a5e97e810bc4fe655a32c0d9a800fb9876c6018686","abstract_canon_sha256":"d89b536815c0e92765ca6b32060a72535fa57099d15026dcd6562ed465c7c460"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.556545Z","signature_b64":"M3Tvka9DJpYjps9SUFwHSY8Rd36sJolUlWeoZ7JL+IEP24as/Bps0SpX6G6eKlP4kOP7eTx7GKOAs2bG6p2UBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a664798e1e247996fd7d2e623c19566f7c7d5765e46c5909b3f8f491fc7c573a","last_reissued_at":"2026-05-17T23:38:52.556038Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.556038Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Seed-TTS: A Family of High-Quality Versatile Speech Generation Models","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests.","cross_cats":["cs.SD"],"primary_cat":"eess.AS","authors_text":"Chao Yao, Chuang Ding, Chumin Li, Dejian Zhong, Dongya Jia, Feiya Li, Hui Li, Jian Cong, Jian Wu, Jiawei Chen, Jiaxin Li, Jitong Chen, Junjie Pan, Junteng Zhang, Lelai Deng, Lin Liu, Lu Gao, Lu Lu, Mingqing Gong, Peisong Huang, Philip Anastassiou, Qidi Zhang, Qingqing Huang, Shouda Liu, Shuo Zhang, Sichao Liu, Wenjie Zhang, Xiaobin Zhuang, Xiaoyang Li, Xingxing Li, Xin Wang, Xudong Liu, Yang Zhang, Yifeng Yang, Yuanhao Yi, Yuanyuan Huo, Yuanzhe Chen, Yuchen Liu, Yuping Wang, Yuxuan Wang, Zhengxi Liu, Zhen Wei, Zhiying Huang, Zhuo Chen, Zilin Zhao, Ziyi Chen","submitted_at":"2024-06-04T15:48:29Z","abstract_excerpt":"We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is cap"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Seed-TTS achieves performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That subjective listener evaluations and the chosen objective metrics reliably indicate real-world indistinguishability and that the models generalize to unseen speakers and conditions without overfitting to the training distribution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7f51cb9aa8a48baaa8e9bf8ab552afb553244791374cfd8f94abb5944a32c8c5"},"source":{"id":"2406.02430","kind":"arxiv","version":1},"verdict":{"id":"2c0eabde-a9d8-46c2-96d4-28928f8ae3ce","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:22:23.552594Z","strongest_claim":"Seed-TTS achieves performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations.","one_line_summary":"Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That subjective listener evaluations and the chosen objective metrics reliably indicate real-world indistinguishability and that the models generalize to unseen speakers and conditions without overfitting to the training distribution.","pith_extraction_headline":"Seed-TTS generates speech that matches human recordings in speaker similarity and naturalness according to objective metrics and listener tests."},"references":{"count":45,"sample":[{"doi":"","year":2023,"title":"Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance","work_id":"083f5388-e34c-4a4d-b4e1-7770c7aade6c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"StreamV oice: Streamable context-aware language modeling for real-time zero-shot voice conversion","work_id":"9bb06a48-785f-4d9f-b05c-8f8580345daf","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"BASE TTS: lessons from building a billion- parameter text-to-speech model on 100k hours of data","work_id":"3cd5a31a-12e5-47a1-bd4a-1d7a9ed783ef","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Mega-TTS: Zero-shot text-to-speech at scale with intrinsic inductive bias","work_id":"304a4ad2-1fb2-4a0d-a21a-8947af9e12c7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Deep reinforcement learning: An overview","work_id":"68a5d043-bfc1-49c5-9396-40bb3d602cee","ref_index":5,"cited_arxiv_id":"1701.07274","is_internal_anchor":true}],"resolved_work":45,"snapshot_sha256":"ba766a6f2b2299bab6a8f6c785e95bad2d658d69dd56abd4d50e90a25f9d3652","internal_anchors":13},"formal_canon":{"evidence_count":3,"snapshot_sha256":"4995e507e11d6c2afe44fed32e8925a215f162992b3b8b14120dca5bb852ccac"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2406.02430","created_at":"2026-05-17T23:38:52.556118+00:00"},{"alias_kind":"arxiv_version","alias_value":"2406.02430v1","created_at":"2026-05-17T23:38:52.556118+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2406.02430","created_at":"2026-05-17T23:38:52.556118+00:00"},{"alias_kind":"pith_short_12","alias_value":"UZSHTDQ6ER4Z","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"UZSHTDQ6ER4ZN7L5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"UZSHTDQ6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":36,"internal_anchor_count":36,"sample":[{"citing_arxiv_id":"2409.18512","citing_title":"Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2505.14066","citing_title":"SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22083","citing_title":"RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2512.01537","citing_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20830","citing_title":"Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17085","citing_title":"Taming Audio VAEs via Target-KL Regularization","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16964","citing_title":"SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15984","citing_title":"Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2507.09318","citing_title":"ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2502.11946","citing_title":"Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22220","citing_title":"StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2510.01284","citing_title":"Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15621","citing_title":"Qwen3-TTS Technical Report","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2410.06885","citing_title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2507.16632","citing_title":"Step-Audio 2 Technical Report","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2505.17589","citing_title":"CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08558","citing_title":"WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2507.08128","citing_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2603.25551","citing_title":"Voxtral TTS","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.00688","citing_title":"OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2412.10117","citing_title":"CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09386","citing_title":"Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19679","citing_title":"MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation","ref_index":1,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5","json":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5.json","graph_json":"https://pith.science/api/pith-number/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/graph.json","events_json":"https://pith.science/api/pith-number/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/events.json","paper":"https://pith.science/paper/UZSHTDQ6"},"agent_actions":{"view_html":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5","download_json":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5.json","view_paper":"https://pith.science/paper/UZSHTDQ6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2406.02430&json=true","fetch_graph":"https://pith.science/api/pith-number/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/graph.json","fetch_events":"https://pith.science/api/pith-number/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/action/storage_attestation","attest_author":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/action/author_attestation","sign_citation":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/action/citation_signature","submit_replication":"https://pith.science/pith/UZSHTDQ6ER4ZN7L5FZRDYGKWN5/action/replication_record"}},"created_at":"2026-05-17T23:38:52.556118+00:00","updated_at":"2026-05-17T23:38:52.556118+00:00"}