{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:XW2OGWZBDLENM2K5RCFFSFTMPR","short_pith_number":"pith:XW2OGWZB","schema_version":"1.0","canonical_sha256":"bdb4e35b211ac8d6695d888a59166c7c7b860f8bbe48a77ac3a4d56cb19a7f16","source":{"kind":"arxiv","id":"2410.06885","version":3},"attestation_state":"computed","paper":{"title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","license":"http://creativecommons.org/licenses/by/4.0/","headline":"F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.","cross_cats":["cs.SD"],"primary_cat":"eess.AS","authors_text":"Chunhui Wang, Jian Zhao, Kai Yu, Keqi Deng, Xie Chen, Yushen Chen, Zhikang Niu, Ziyang Ma","submitted_at":"2024-10-09T13:46:34Z","abstract_excerpt":"This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2410.06885","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"eess.AS","submitted_at":"2024-10-09T13:46:34Z","cross_cats_sorted":["cs.SD"],"title_canon_sha256":"15be299e681d350e3ac1f0251de0a246b339839e069bf3787ea03c96f457655e","abstract_canon_sha256":"b3a7045cc5062db49e67d6f377ddfcde6cf72ad49da3746e8f4cf3af1b7b1d89"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.894604Z","signature_b64":"wPikmzL0BluWtp+DZYa+Nw0Z3i+CFrZ73SvtzHBGmK17owCZR/onDTV5M+X+ZgXdDcbOO/v6qTs6mlUCyiVWDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"bdb4e35b211ac8d6695d888a59166c7c7b860f8bbe48a77ac3a4d56cb19a7f16","last_reissued_at":"2026-05-17T23:38:48.894074Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.894074Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","license":"http://creativecommons.org/licenses/by/4.0/","headline":"F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.","cross_cats":["cs.SD"],"primary_cat":"eess.AS","authors_text":"Chunhui Wang, Jian Zhao, Kai Yu, Keqi Deng, Xie Chen, Yushen Chen, Zhikang Niu, Ziyang Ma","submitted_at":"2024-10-09T13:46:34Z","abstract_excerpt":"This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That simply padding text with filler tokens and refining with ConvNeXt is sufficient to achieve robust alignment and fast convergence without duration models or phoneme alignment, building on the feasibility shown by E2 TTS.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"65d7acd3c0efeb3d87330cd23d4bd550d8bd3f76a88256cd9ca0688c9279bc14"},"source":{"id":"2410.06885","kind":"arxiv","version":3},"verdict":{"id":"61f63891-6c96-450c-980a-545cd7d26cf7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T06:00:45.349547Z","strongest_claim":"Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency.","one_line_summary":"F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That simply padding text with filler tokens and refining with ConvNeXt is sufficient to achieve robust alignment and fast convergence without duration models or phoneme alignment, building on the feasibility shown by E2 TTS.","pith_extraction_headline":"F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model."},"references":{"count":128,"sample":[{"doi":"","year":null,"title":"Keith Ito and Linda Johnson , title =","work_id":"693090f8-48cd-433f-957e-c5c483264d26","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"International Conference on Machine Learning , pages=","work_id":"e38b79cf-3429-4fda-9658-7d665a341fe8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , volume=","work_id":"53d35174-0a7b-4232-8532-8dc58228c19c","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Liu, Zhijun and Wang, Shuai and Zhu, Pengcheng and Bi, Mengxiao and Li, Haizhou , journal=","work_id":"c743056b-566a-4d23-8d2e-98a9673b9039","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Meister, Aleksandr and Novikov, Matvei and Karpov, Nikolay and Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris , booktitle=. 2023 , organization=","work_id":"b36e2699-594f-4c72-9e58-849f2423a9ab","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":128,"snapshot_sha256":"f76615984a0ad3ca81c44e5801c231a76696190cff3011f9af837cb294042447","internal_anchors":11},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.06885","created_at":"2026-05-17T23:38:48.894155+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.06885v3","created_at":"2026-05-17T23:38:48.894155+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.06885","created_at":"2026-05-17T23:38:48.894155+00:00"},{"alias_kind":"pith_short_12","alias_value":"XW2OGWZBDLEN","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"XW2OGWZBDLENM2K5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"XW2OGWZB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2605.23859","citing_title":"Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15984","citing_title":"Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2507.09318","citing_title":"ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2509.04072","citing_title":"Computational Narrative Understanding for Expressive Text-to-Speech","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20086","citing_title":"OLaPh: Optimal Language Phonemizer","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2510.19414","citing_title":"EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15621","citing_title":"Qwen3-TTS Technical Report","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2602.05449","citing_title":"DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.17589","citing_title":"CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14555","citing_title":"Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.00688","citing_title":"OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2412.10117","citing_title":"CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12310","citing_title":"Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09386","citing_title":"Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25441","citing_title":"Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02374","citing_title":"Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02496","citing_title":"Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22209","citing_title":"UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21164","citing_title":"MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21481","citing_title":"Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19055","citing_title":"ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12292","citing_title":"CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11103","citing_title":"ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08363","citing_title":"CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13229","citing_title":"ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks","ref_index":40,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR","json":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR.json","graph_json":"https://pith.science/api/pith-number/XW2OGWZBDLENM2K5RCFFSFTMPR/graph.json","events_json":"https://pith.science/api/pith-number/XW2OGWZBDLENM2K5RCFFSFTMPR/events.json","paper":"https://pith.science/paper/XW2OGWZB"},"agent_actions":{"view_html":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR","download_json":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR.json","view_paper":"https://pith.science/paper/XW2OGWZB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.06885&json=true","fetch_graph":"https://pith.science/api/pith-number/XW2OGWZBDLENM2K5RCFFSFTMPR/graph.json","fetch_events":"https://pith.science/api/pith-number/XW2OGWZBDLENM2K5RCFFSFTMPR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR/action/storage_attestation","attest_author":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR/action/author_attestation","sign_citation":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR/action/citation_signature","submit_replication":"https://pith.science/pith/XW2OGWZBDLENM2K5RCFFSFTMPR/action/replication_record"}},"created_at":"2026-05-17T23:38:48.894155+00:00","updated_at":"2026-05-17T23:38:48.894155+00:00"}