pith. machine review for the scientific record. sign in

arxiv: 2410.06885 · v3 · submitted 2024-10-09 · 📡 eess.AS · cs.SD

Recognition: no theorem link

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:00 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords text-to-speechflow matchingDiffusion Transformernon-autoregressive TTSzero-shot synthesisConvNeXtSway Sampling
0
0 comments X

The pith

F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a non-autoregressive TTS system that performs speech generation through flow matching on a Diffusion Transformer. It avoids duration models, text encoders, and phoneme alignment by padding the input text with filler tokens to equal the length of the target speech waveform. A ConvNeXt module refines the padded text representation to support alignment, while an inference-only Sway Sampling strategy boosts both quality and speed. The resulting model trains faster than prior diffusion TTS systems, reaches an inference RTF of 0.15, and shows strong naturalness, expressiveness, and multilingual zero-shot performance after training on a 100K-hour public dataset.

Core claim

By padding text inputs with filler tokens to match speech length, refining the text representation with ConvNeXt, and applying flow matching in a Diffusion Transformer, the system achieves robust alignment and fast convergence without duration models or phoneme alignment; an added Sway Sampling procedure at inference time further improves performance and efficiency.

What carries the argument

Padding of text with filler tokens to match speech length, followed by ConvNeXt refinement and Sway Sampling, inside a Diffusion Transformer flow-matching network.

If this is right

  • Training completes faster than existing diffusion-based TTS models.
  • Inference runs at an RTF of 0.15, well below typical diffusion TTS speeds.
  • Zero-shot synthesis produces highly natural and expressive speech across languages.
  • Code-switching between languages occurs seamlessly without extra training.
  • Speaking rate can be controlled directly through the sampling process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same padding and refinement pattern could be tested on other flow-matching audio tasks such as speech enhancement or music generation.
  • Because Sway Sampling requires no retraining, it offers an immediate efficiency upgrade for any existing flow-matching TTS system.
  • Scaling the training data beyond 100K hours would likely strengthen the zero-shot and code-switching results further.

Load-bearing premise

That simply padding text with filler tokens and refining the result with ConvNeXt supplies enough alignment information to let flow matching generate fluent speech without any duration model or phoneme alignment.

What would settle it

Training the same architecture on the 100K-hour dataset while measuring whether alignment errors or slow convergence appear when the ConvNeXt refinement step is removed would directly test whether the padding-plus-ConvNeXt design carries the claimed robustness.

read the original abstract

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces F5-TTS, a fully non-autoregressive TTS system based on flow matching with a Diffusion Transformer (DiT). It simplifies the design by padding text inputs with filler tokens to match speech length (building on E2 TTS), refines the text representation with ConvNeXt for better alignment, and proposes an inference-time Sway Sampling strategy. The work claims faster training, an RTF of 0.15, highly natural zero-shot multilingual synthesis, seamless code-switching, and efficient speed control on a public 100K-hour dataset, with code and checkpoints released.

Significance. If the empirical claims hold, the work would be significant for efficient TTS by demonstrating that simplified conditioning (filler padding + ConvNeXt) can replace duration models and explicit alignment while achieving competitive RTF and zero-shot performance on large-scale multilingual data. The release of code supports reproducibility and the Sway Sampling strategy offers a general inference improvement for flow-matching models.

major comments (3)
  1. [§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.
  2. [Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.
  3. [Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.
minor comments (2)
  1. [Abstract] The title uses an informal acronym (F5-TTS) that should be expanded on first use in the abstract and introduction for clarity.
  2. [Introduction] The E2 TTS reference is mentioned but lacks a full citation; add the complete bibliographic entry.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed to strengthen the claims, we will incorporate the suggested analyses and clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.

    Authors: We appreciate the referee's observation. The manuscript already includes direct comparisons to E2 TTS demonstrating faster convergence and improved multilingual robustness on the 100K-hour dataset. To more explicitly isolate the contribution of the ConvNeXt refinement block to implicit alignment, we will add an ablation removing the ConvNeXt module, along with cross-attention visualizations and comparisons of learned duration distributions versus E2 TTS in the revised §3 and experimental sections. revision: yes

  2. Referee: [Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.

    Authors: The full manuscript contains comprehensive tables and figures reporting RTF (0.15), training-time comparisons, MOS, WER, and speaker similarity against diffusion-based baselines, with statistical details. We agree the abstract would benefit from referencing key quantitative results. In the revision we will update the abstract to include the RTF value and a concise statement of the main metric improvements while respecting length constraints; the detailed tables and figures will remain in the results section. revision: partial

  3. Referee: [Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.

    Authors: We agree that the precise formulation and targeted ablation are valuable. The manuscript describes Sway Sampling at a high level; we will add the exact equation governing the step-size modulation in the inference section. We will also include a dedicated ablation study quantifying its isolated effect on code-switching naturalness and zero-shot stability using the 100K-hour multilingual data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with external citation

full rationale

The paper describes an empirical neural TTS architecture based on flow matching and DiT, where text is padded with filler tokens and refined via ConvNeXt before denoising. It cites E2 TTS only for the basic padding feasibility (an external prior result) and introduces Sway Sampling as a new inference heuristic. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to the inputs by construction. Training on 100k hours of data and released code provide external verifiability, so the central claims remain independent of any self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of text padding plus ConvNeXt for alignment and sway sampling for inference; these are domain assumptions drawn from prior TTS literature rather than new axioms.

free parameters (1)
  • ConvNeXt and DiT hyperparameters
    Standard neural network training choices whose values are not detailed in the abstract.
axioms (1)
  • domain assumption Padding text with filler tokens to speech length enables flow matching to generate aligned speech without explicit duration or alignment modules
    Invoked when describing the core generation process, referencing E2 TTS feasibility.

pith-pipeline@v0.9.0 · 5565 in / 1270 out tokens · 49508 ms · 2026-05-16T06:00:45.349547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.

  2. Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 7.0

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  3. Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

    cs.SD 2026-05 unverdicted novelty 7.0

    Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

  4. Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

    cs.CL 2026-04 unverdicted novelty 7.0

    A controlled pairwise evaluation framework for multilingual TTS in 10 Indic languages produces a preference leaderboard using Bradley-Terry modeling and SHAP analysis on 120K+ comparisons.

  5. MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

    cs.SD 2026-04 unverdicted novelty 7.0

    MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.

  6. CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

    cs.SD 2026-04 unverdicted novelty 7.0

    CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...

  7. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  8. DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

    cs.CV 2026-02 unverdicted novelty 7.0

    DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.

  9. Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

    cs.SD 2026-05 unverdicted novelty 6.0

    Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.

  10. Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

    cs.SD 2026-04 unverdicted novelty 6.0

    A combination of phoneme romanization, targeted LoRA adaptation, and voice-prompt recovery enables commercial-class Indic TTS from a non-Indic base without acoustic retraining or commercial data.

  11. UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

    eess.AS 2026-04 unverdicted novelty 6.0

    UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.

  12. ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

    eess.AS 2026-04 unverdicted novelty 6.0

    ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.

  13. OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...

  14. Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

    cs.SD 2026-03 accept novelty 6.0

    RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.

  15. CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    cs.SD 2025-05 unverdicted novelty 6.0

    CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...

  16. ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

    cs.SD 2026-04 unverdicted novelty 5.0

    ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.

  17. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    cs.SD 2024-12 unverdicted novelty 5.0

    CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...

  18. ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

    cs.SD 2026-04 unverdicted novelty 4.0

    ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended...

  19. Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

    eess.AS 2026-04 unverdicted novelty 3.0

    Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · cited by 19 Pith papers · 11 internal anchors

  1. [2]

    Keith Ito and Linda Johnson , title =

  2. [5]

    International Conference on Machine Learning , pages=

    A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  3. [6]

    Advances in Neural Information Processing Systems , volume=

    StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models , author=. Advances in Neural Information Processing Systems , volume=

  4. [7]

    Liu, Zhijun and Wang, Shuai and Zhu, Pengcheng and Bi, Mengxiao and Li, Haizhou , journal=

  5. [8]

    2023 , organization=

    Meister, Aleksandr and Novikov, Matvei and Karpov, Nikolay and Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris , booktitle=. 2023 , organization=

  6. [9]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

  7. [10]

    Wang, Tianrui and Zhou, Long and Zhang, Ziqiang and Wu, Yu and Liu, Shujie and Gaur, Yashesh and Chen, Zhuo and Li, Jinyu and Wei, Furu , journal=

  8. [11]

    2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    Crowdmos: An approach for crowdsourcing mean opinion score studies , author=. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2011 , organization=

  9. [12]

    International conference on machine learning , pages=

    Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

  10. [13]

    Libri-light: A benchmark for

    Kahn, Jacob and Riviere, Morgane and Zheng, Weiyi and Kharitonov, Evgeny and others , booktitle=. Libri-light: A benchmark for. 2020 , organization=

  11. [14]

    Libriheavy: a 50,000 hours asr corpus with punctuation casing and context , author=. Proc. ICASSP , pages=. 2024 , organization=

  12. [16]

    Large-scale self-supervised speech representation learning for automatic speaker verification , author=. Proc. ICASSP , pages=. 2022 , organization=

  13. [17]

    Gao, Zhifu and Li, Zerui and Wang, Jiaming and Luo, Haoneng and Shi, Xian and Chen, Mengzhe and Li, Yabin and Zuo, Lingyun and Du, Zhihao and Xiao, Zhangyu and others , journal=

  14. [18]

    Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Shunsi and Wu, Zhizheng , journal=

  15. [19]

    Guo, Hao-Han and Liu, Kun and Shen, Fei-Yu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun and Xu, Kai-Tuo , journal=

  16. [20]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Analyzing and improving the training dynamics of diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  17. [22]

    Librispeech: an

    Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: an. 2015 , organization=

  18. [23]

    Didispeech: A large scale

    Guo, Tingwei and Wen, Cheng and Jiang, Dongwei and Luo, Ne and others , booktitle=. Didispeech: A large scale. 2021 , organization=

  19. [25]

    Ma, Linhan and Guo, Dake and Song, Kun and Jiang, Yuepeng and Wang, Shuai and Xue, Liumeng and Xu, Weiming and Zhao, Huan and Zhang, Binbin and Xie, Lei , journal=

  20. [27]

    Biometrika , volume=

    Logistic-normal distributions: Some properties and uses , author=. Biometrika , volume=. 1980 , publisher=

  21. [29]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  22. [30]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  23. [31]

    2024 , organization=

    Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi , booktitle=. 2024 , organization=

  24. [33]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Convnext v2: Co-designing and scaling convnets with masked autoencoders , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  25. [34]

    U-net: Convolutional networks for biomedical image segmentation , author=. Proc. MICCAI , pages=. 2015 , organization=

  26. [35]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  27. [36]

    Advances in neural information processing systems , volume=

    Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

  28. [38]

    Chen, Ricky T. Q. , title=. 2018 , url=

  29. [39]

    Understanding diffusion objectives as the

    Kingma, Diederik and Gao, Ruiqi , journal=. Understanding diffusion objectives as the

  30. [41]

    Niu, Zhikang and Chen, Sanyuan and Zhou, Long and Ma, Ziyang and Chen, Xie and Liu, Shujie , journal=

  31. [42]

    2023 , organization=

    Gao, Yuan and Morioka, Nobuyuki and Zhang, Yu and Chen, Nanxin , booktitle=. 2023 , organization=

  32. [43]

    Anastassiou, Philip and Chen, Jiawei and Chen, Jitong and Chen, Yuanzhe and others , journal=

  33. [44]

    Eskimez, Sefik Emre and Wang, Xiaofei and Thakker, Manthan and Li, Canrun and others , journal=

  34. [45]

    Bai, He and Likhomanenko, Tatiana and Zhang, Ruixiang and Gu, Zijin and Aldeneh, Zakaria and Jaitly, Navdeep , journal=

  35. [46]

    Lee, Keon and Kim, Dong Won and Kim, Jaehyeon and Cho, Jaewoong , journal=

  36. [47]

    Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz. Proc. ICASSP , pages=. 2024 , organization=

  37. [48]

    Voiceflow: Efficient text-to-speech with rectified flow matching , author=. Proc. ICASSP , pages=. 2024 , organization=

  38. [50]

    Scaling rectified flow transformers for high-resolution image synthesis , author=. Proc. ICML , year=

  39. [53]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  40. [56]

    Audiodec: An open-source streaming high-fidelity neural audio codec , author=. Proc. ICASSP , pages=. 2023 , organization=

  41. [57]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

    Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

  42. [63]

    Song, Yakun and Chen, Zhuo and Wang, Xiaofei and Ma, Ziyang and Chen, Xie , journal=

  43. [64]

    Han, Bing and Zhou, Long and Liu, Shujie and Chen, Sanyuan and Meng, Lingwei and Qian, Yanming and Liu, Yanqing and Zhao, Sheng and Li, Jinyu and Wei, Furu , journal=

  44. [66]

    Du, Chenpeng and Guo, Yiwei and Wang, Hankun and Yang, Yifan and Niu, Zhikang and Wang, Shuai and Zhang, Hui and Chen, Xie and Yu, Kai , journal=

  45. [67]

    Advances in neural information processing systems , volume=

    Voicebox: Text-guided multilingual universal speech generation at scale , author=. Advances in neural information processing systems , volume=

  46. [68]

    Chen, Sanyuan and Liu, Shujie and Zhou, Long and Liu, Yanqing and Tan, Xu and Li, Jinyu and Zhao, Sheng and Qian, Yao and Wei, Furu , journal=

  47. [73]

    International Conference on Machine Learning , pages=

    Grad-tts: A diffusion probabilistic model for text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  48. [74]

    Kim, Jaehyeon and Kim, Sungwon and Kong, Jungil and Yoon, Sungroh , journal=

  49. [75]

    Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and others , booktitle=. Natural. 2018 , organization=

  50. [77]

    International Conference on Machine Learning , pages=

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  51. [78]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Neural speech synthesis with transformer network , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  52. [79]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Naturalspeech: End-to-end text-to-speech synthesis with human-level quality , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  53. [80]

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, et al. 2024. Seed-TTS : A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430

  54. [81]

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670

  55. [82]

    Jhon Atchison and Sheng M Shen. 1980. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261--272

  56. [83]

    He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. 2024. dMel : Speech tokenization made simple. arXiv preprint arXiv:2407.15835

  57. [84]

    He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. 2022. A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, pages 1399--1411. PMLR

  58. [85]

    Ricky T. Q. Chen. 2018. https://github.com/rtqichen/torchdiffeq torchdiffeq

  59. [86]

    Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024. VALL-E 2 : Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370

  60. [87]

    Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. 2022. Large-scale self-supervised speech representation learning for automatic speaker verification. In Proc. ICASSP, pages 6147--6151. IEEE

  61. [88]

    Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438

  62. [89]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780--8794

  63. [90]

    Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. 2024 a . VALL-T : Decoder-only generative transducer for robust and decoding-controllable text-to-speech. arXiv preprint arXiv:2401.14321

  64. [91]

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, et al. 2024 b . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

  65. [92]

    Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, et al. 2024. E2 TTS : Embarrassingly easy fully non-autoregressive zero-shot TTS . arXiv preprint arXiv:2406.18009

  66. [93]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Proc. ICML

  67. [94]

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. 2024. Flux that plays music. arXiv preprint arXiv:2409.00587

  68. [95]

    Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. 2023 a . E3 TTS : Easy end-to-end diffusion-based text to speech. In Proc. ASRU, pages 1--8. IEEE

  69. [96]

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. 2023 b . FunASR : A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013

  70. [97]

    Hao-Han Guo, Kun Liu, Fei-Yu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. 2024 a . FireRedTTS : A foundation text-to-speech framework for industry-level generative speech applications. arXiv preprint arXiv:2409.03283

  71. [98]

    Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, et al. 2021. Didispeech: A large scale Mandarin speech corpus. In Proc. ICASSP, pages 6968--6972. IEEE

  72. [99]

    Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. 2024 b . Voiceflow: Efficient text-to-speech with rectified flow matching. In Proc. ICASSP, pages 11121--11125. IEEE

  73. [100]

    Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. 2024. VALL-E R : Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment. arXiv preprint arXiv:2406.07855

  74. [101]

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, et al. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. arXiv preprint arXiv:2407.05361

  75. [102]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840--6851

  76. [103]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598

  77. [104]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460

  78. [105]

    Keith Ito and Linda Johnson. 2017. https://keithito.com/LJ-Speech-Dataset/ The LJ speech dataset

  79. [106]

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. 2024. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100

  80. [107]

    Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, et al. 2020. Libri-light: A benchmark for ASR with limited or no supervision. In Proc. ICASSP, pages 7669--7673. IEEE

Showing first 80 references.