Recognition: no theorem link
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Pith reviewed 2026-05-16 06:00 UTC · model grok-4.3
The pith
F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By padding text inputs with filler tokens to match speech length, refining the text representation with ConvNeXt, and applying flow matching in a Diffusion Transformer, the system achieves robust alignment and fast convergence without duration models or phoneme alignment; an added Sway Sampling procedure at inference time further improves performance and efficiency.
What carries the argument
Padding of text with filler tokens to match speech length, followed by ConvNeXt refinement and Sway Sampling, inside a Diffusion Transformer flow-matching network.
If this is right
- Training completes faster than existing diffusion-based TTS models.
- Inference runs at an RTF of 0.15, well below typical diffusion TTS speeds.
- Zero-shot synthesis produces highly natural and expressive speech across languages.
- Code-switching between languages occurs seamlessly without extra training.
- Speaking rate can be controlled directly through the sampling process.
Where Pith is reading between the lines
- The same padding and refinement pattern could be tested on other flow-matching audio tasks such as speech enhancement or music generation.
- Because Sway Sampling requires no retraining, it offers an immediate efficiency upgrade for any existing flow-matching TTS system.
- Scaling the training data beyond 100K hours would likely strengthen the zero-shot and code-switching results further.
Load-bearing premise
That simply padding text with filler tokens and refining the result with ConvNeXt supplies enough alignment information to let flow matching generate fluent speech without any duration model or phoneme alignment.
What would settle it
Training the same architecture on the 100K-hour dataset while measuring whether alignment errors or slow convergence appear when the ConvNeXt refinement step is removed would directly test whether the padding-plus-ConvNeXt design carries the claimed robustness.
read the original abstract
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces F5-TTS, a fully non-autoregressive TTS system based on flow matching with a Diffusion Transformer (DiT). It simplifies the design by padding text inputs with filler tokens to match speech length (building on E2 TTS), refines the text representation with ConvNeXt for better alignment, and proposes an inference-time Sway Sampling strategy. The work claims faster training, an RTF of 0.15, highly natural zero-shot multilingual synthesis, seamless code-switching, and efficient speed control on a public 100K-hour dataset, with code and checkpoints released.
Significance. If the empirical claims hold, the work would be significant for efficient TTS by demonstrating that simplified conditioning (filler padding + ConvNeXt) can replace duration models and explicit alignment while achieving competitive RTF and zero-shot performance on large-scale multilingual data. The release of code supports reproducibility and the Sway Sampling strategy offers a general inference improvement for flow-matching models.
major comments (3)
- [§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.
- [Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.
- [Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.
minor comments (2)
- [Abstract] The title uses an informal acronym (F5-TTS) that should be expanded on first use in the abstract and introduction for clarity.
- [Introduction] The E2 TTS reference is mentioned but lacks a full citation; add the complete bibliographic entry.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed to strengthen the claims, we will incorporate the suggested analyses and clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.
Authors: We appreciate the referee's observation. The manuscript already includes direct comparisons to E2 TTS demonstrating faster convergence and improved multilingual robustness on the 100K-hour dataset. To more explicitly isolate the contribution of the ConvNeXt refinement block to implicit alignment, we will add an ablation removing the ConvNeXt module, along with cross-attention visualizations and comparisons of learned duration distributions versus E2 TTS in the revised §3 and experimental sections. revision: yes
-
Referee: [Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.
Authors: The full manuscript contains comprehensive tables and figures reporting RTF (0.15), training-time comparisons, MOS, WER, and speaker similarity against diffusion-based baselines, with statistical details. We agree the abstract would benefit from referencing key quantitative results. In the revision we will update the abstract to include the RTF value and a concise statement of the main metric improvements while respecting length constraints; the detailed tables and figures will remain in the results section. revision: partial
-
Referee: [Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.
Authors: We agree that the precise formulation and targeted ablation are valuable. The manuscript describes Sway Sampling at a high level; we will add the exact equation governing the step-size modulation in the inference section. We will also include a dedicated ablation study quantifying its isolated effect on code-switching naturalness and zero-shot stability using the 100K-hour multilingual data. revision: yes
Circularity Check
No significant circularity; empirical architecture with external citation
full rationale
The paper describes an empirical neural TTS architecture based on flow matching and DiT, where text is padded with filler tokens and refined via ConvNeXt before denoising. It cites E2 TTS only for the basic padding feasibility (an external prior result) and introduces Sway Sampling as a new inference heuristic. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to the inputs by construction. Training on 100k hours of data and released code provide external verifiability, so the central claims remain independent of any self-referential loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- ConvNeXt and DiT hyperparameters
axioms (1)
- domain assumption Padding text with filler tokens to speech length enables flow matching to generate aligned speech without explicit duration or alignment modules
Forward citations
Cited by 19 Pith papers
-
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
-
Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
A controlled pairwise evaluation framework for multilingual TTS in 10 Indic languages produces a preference leaderboard using Bradley-Terry modeling and SHAP analysis on 120K+ comparisons.
-
MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.
-
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
-
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.
-
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
A combination of phoneme romanization, targeted LoRA adaptation, and voice-prompt recovery enables commercial-class Indic TTS from a non-Indic base without acoustic retraining or commercial data.
-
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
-
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
-
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
-
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
-
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
-
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...
-
ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis
ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended...
-
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
Reference graph
Works this paper leans on
-
[2]
Keith Ito and Linda Johnson , title =
-
[5]
International Conference on Machine Learning , pages=
A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[6]
Advances in Neural Information Processing Systems , volume=
StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models , author=. Advances in Neural Information Processing Systems , volume=
-
[7]
Liu, Zhijun and Wang, Shuai and Zhu, Pengcheng and Bi, Mengxiao and Li, Haizhou , journal=
-
[8]
Meister, Aleksandr and Novikov, Matvei and Karpov, Nikolay and Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris , booktitle=. 2023 , organization=
work page 2023
-
[9]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=
work page 2021
-
[10]
Wang, Tianrui and Zhou, Long and Zhang, Ziqiang and Wu, Yu and Liu, Shujie and Gaur, Yashesh and Chen, Zhuo and Li, Jinyu and Wei, Furu , journal=
-
[11]
2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=
Crowdmos: An approach for crowdsourcing mean opinion score studies , author=. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2011 , organization=
work page 2011
-
[12]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[13]
Kahn, Jacob and Riviere, Morgane and Zheng, Weiyi and Kharitonov, Evgeny and others , booktitle=. Libri-light: A benchmark for. 2020 , organization=
work page 2020
-
[14]
Libriheavy: a 50,000 hours asr corpus with punctuation casing and context , author=. Proc. ICASSP , pages=. 2024 , organization=
work page 2024
-
[16]
Large-scale self-supervised speech representation learning for automatic speaker verification , author=. Proc. ICASSP , pages=. 2022 , organization=
work page 2022
-
[17]
Gao, Zhifu and Li, Zerui and Wang, Jiaming and Luo, Haoneng and Shi, Xian and Chen, Mengzhe and Li, Yabin and Zuo, Lingyun and Du, Zhihao and Xiao, Zhangyu and others , journal=
-
[18]
Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Shunsi and Wu, Zhizheng , journal=
-
[19]
Guo, Hao-Han and Liu, Kun and Shen, Fei-Yu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun and Xu, Kai-Tuo , journal=
-
[20]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Analyzing and improving the training dynamics of diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[22]
Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: an. 2015 , organization=
work page 2015
-
[23]
Guo, Tingwei and Wen, Cheng and Jiang, Dongwei and Luo, Ne and others , booktitle=. Didispeech: A large scale. 2021 , organization=
work page 2021
-
[25]
Ma, Linhan and Guo, Dake and Song, Kun and Jiang, Yuepeng and Wang, Shuai and Xue, Liumeng and Xu, Weiming and Zhao, Huan and Zhang, Binbin and Xie, Lei , journal=
-
[27]
Logistic-normal distributions: Some properties and uses , author=. Biometrika , volume=. 1980 , publisher=
work page 1980
-
[29]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[30]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[31]
Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi , booktitle=. 2024 , organization=
work page 2024
-
[33]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Convnext v2: Co-designing and scaling convnets with masked autoencoders , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[34]
U-net: Convolutional networks for biomedical image segmentation , author=. Proc. MICCAI , pages=. 2015 , organization=
work page 2015
-
[35]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[36]
Advances in neural information processing systems , volume=
Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=
-
[38]
Chen, Ricky T. Q. , title=. 2018 , url=
work page 2018
-
[39]
Understanding diffusion objectives as the
Kingma, Diederik and Gao, Ruiqi , journal=. Understanding diffusion objectives as the
-
[41]
Niu, Zhikang and Chen, Sanyuan and Zhou, Long and Ma, Ziyang and Chen, Xie and Liu, Shujie , journal=
-
[42]
Gao, Yuan and Morioka, Nobuyuki and Zhang, Yu and Chen, Nanxin , booktitle=. 2023 , organization=
work page 2023
-
[43]
Anastassiou, Philip and Chen, Jiawei and Chen, Jitong and Chen, Yuanzhe and others , journal=
-
[44]
Eskimez, Sefik Emre and Wang, Xiaofei and Thakker, Manthan and Li, Canrun and others , journal=
-
[45]
Bai, He and Likhomanenko, Tatiana and Zhang, Ruixiang and Gu, Zijin and Aldeneh, Zakaria and Jaitly, Navdeep , journal=
-
[46]
Lee, Keon and Kim, Dong Won and Kim, Jaehyeon and Cho, Jaewoong , journal=
-
[47]
Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz. Proc. ICASSP , pages=. 2024 , organization=
work page 2024
-
[48]
Voiceflow: Efficient text-to-speech with rectified flow matching , author=. Proc. ICASSP , pages=. 2024 , organization=
work page 2024
-
[50]
Scaling rectified flow transformers for high-resolution image synthesis , author=. Proc. ICML , year=
-
[53]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[56]
Audiodec: An open-source streaming high-fidelity neural audio codec , author=. Proc. ICASSP , pages=. 2023 , organization=
work page 2023
-
[57]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=
work page 2021
-
[63]
Song, Yakun and Chen, Zhuo and Wang, Xiaofei and Ma, Ziyang and Chen, Xie , journal=
-
[64]
Han, Bing and Zhou, Long and Liu, Shujie and Chen, Sanyuan and Meng, Lingwei and Qian, Yanming and Liu, Yanqing and Zhao, Sheng and Li, Jinyu and Wei, Furu , journal=
-
[66]
Du, Chenpeng and Guo, Yiwei and Wang, Hankun and Yang, Yifan and Niu, Zhikang and Wang, Shuai and Zhang, Hui and Chen, Xie and Yu, Kai , journal=
-
[67]
Advances in neural information processing systems , volume=
Voicebox: Text-guided multilingual universal speech generation at scale , author=. Advances in neural information processing systems , volume=
-
[68]
Chen, Sanyuan and Liu, Shujie and Zhou, Long and Liu, Yanqing and Tan, Xu and Li, Jinyu and Zhao, Sheng and Qian, Yao and Wei, Furu , journal=
-
[73]
International Conference on Machine Learning , pages=
Grad-tts: A diffusion probabilistic model for text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[74]
Kim, Jaehyeon and Kim, Sungwon and Kong, Jungil and Yoon, Sungroh , journal=
-
[75]
Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and others , booktitle=. Natural. 2018 , organization=
work page 2018
-
[77]
International Conference on Machine Learning , pages=
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[78]
Proceedings of the AAAI conference on artificial intelligence , volume=
Neural speech synthesis with transformer network , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[79]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[80]
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, et al. 2024. Seed-TTS : A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [81]
-
[82]
Jhon Atchison and Sheng M Shen. 1980. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261--272
work page 1980
- [83]
-
[84]
He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. 2022. A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, pages 1399--1411. PMLR
work page 2022
-
[85]
Ricky T. Q. Chen. 2018. https://github.com/rtqichen/torchdiffeq torchdiffeq
work page 2018
- [86]
-
[87]
Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. 2022. Large-scale self-supervised speech representation learning for automatic speaker verification. In Proc. ICASSP, pages 6147--6151. IEEE
work page 2022
-
[88]
Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[89]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780--8794
work page 2021
- [90]
-
[91]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, et al. 2024 b . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [92]
-
[93]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Proc. ICML
work page 2024
- [94]
-
[95]
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. 2023 a . E3 TTS : Easy end-to-end diffusion-based text to speech. In Proc. ASRU, pages 1--8. IEEE
work page 2023
- [96]
- [97]
-
[98]
Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, et al. 2021. Didispeech: A large scale Mandarin speech corpus. In Proc. ICASSP, pages 6968--6972. IEEE
work page 2021
-
[99]
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. 2024 b . Voiceflow: Efficient text-to-speech with rectified flow matching. In Proc. ICASSP, pages 11121--11125. IEEE
work page 2024
- [100]
- [101]
-
[102]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840--6851
work page 2020
-
[103]
Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[104]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460
work page 2021
-
[105]
Keith Ito and Linda Johnson. 2017. https://keithito.com/LJ-Speech-Dataset/ The LJ speech dataset
work page 2017
- [106]
-
[107]
Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, et al. 2020. Libri-light: A benchmark for ASR with limited or no supervision. In Proc. ICASSP, pages 7669--7673. IEEE
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.