arxiv: 2410.06885 · v3 · submitted 2024-10-09 · 📡 eess.AS · cs.SD

Recognition: no theorem link

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen , Zhikang Niu , Ziyang Ma , Keqi Deng , Chunhui Wang , Jian Zhao , Kai Yu , Xie Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:00 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords text-to-speechflow matchingDiffusion Transformernon-autoregressive TTSzero-shot synthesisConvNeXtSway Sampling

0 comments

The pith

F5-TTS generates natural zero-shot speech by padding text with filler tokens and refining it with ConvNeXt inside a flow-matching DiT model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a non-autoregressive TTS system that performs speech generation through flow matching on a Diffusion Transformer. It avoids duration models, text encoders, and phoneme alignment by padding the input text with filler tokens to equal the length of the target speech waveform. A ConvNeXt module refines the padded text representation to support alignment, while an inference-only Sway Sampling strategy boosts both quality and speed. The resulting model trains faster than prior diffusion TTS systems, reaches an inference RTF of 0.15, and shows strong naturalness, expressiveness, and multilingual zero-shot performance after training on a 100K-hour public dataset.

Core claim

By padding text inputs with filler tokens to match speech length, refining the text representation with ConvNeXt, and applying flow matching in a Diffusion Transformer, the system achieves robust alignment and fast convergence without duration models or phoneme alignment; an added Sway Sampling procedure at inference time further improves performance and efficiency.

What carries the argument

Padding of text with filler tokens to match speech length, followed by ConvNeXt refinement and Sway Sampling, inside a Diffusion Transformer flow-matching network.

If this is right

Training completes faster than existing diffusion-based TTS models.
Inference runs at an RTF of 0.15, well below typical diffusion TTS speeds.
Zero-shot synthesis produces highly natural and expressive speech across languages.
Code-switching between languages occurs seamlessly without extra training.
Speaking rate can be controlled directly through the sampling process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same padding and refinement pattern could be tested on other flow-matching audio tasks such as speech enhancement or music generation.
Because Sway Sampling requires no retraining, it offers an immediate efficiency upgrade for any existing flow-matching TTS system.
Scaling the training data beyond 100K hours would likely strengthen the zero-shot and code-switching results further.

Load-bearing premise

That simply padding text with filler tokens and refining the result with ConvNeXt supplies enough alignment information to let flow matching generate fluent speech without any duration model or phoneme alignment.

What would settle it

Training the same architecture on the 100K-hour dataset while measuring whether alignment errors or slow convergence appear when the ConvNeXt refinement step is removed would directly test whether the padding-plus-ConvNeXt design carries the claimed robustness.

read the original abstract

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F5-TTS adds ConvNeXt text refinement and sway sampling to the E2 TTS padding approach, with code release, but the abstract leaves the performance claims unquantified.

read the letter

F5-TTS takes the E2 TTS idea of padding text with filler tokens to match speech length and runs flow matching inside a DiT, then layers on ConvNeXt to clean up the text features and a sway sampling schedule at inference. Those two changes are the actual increments. The paper reports faster training, an RTF of 0.15, and solid zero-shot multilingual results including code-switching on 100 k hours of public data, and it ships the code and checkpoints. That last part matters for anyone who wants to test the claims directly. The sway sampling is also framed as something that can be added to other flow-matching models without retraining, which keeps the contribution practical rather than purely architectural. The core simplification—no duration model, no separate text encoder, no phoneme alignment—still rests on the padded-text conditioning working well enough for the flow ODE to learn variable timing and prosody. The ConvNeXt block is meant to make that conditioning robust, but the abstract supplies no ablation numbers or alignment error metrics to show how much it actually helps over plain padding. Without those tables it is hard to judge whether the reported naturalness and code-switching quality are clearly ahead of recent diffusion baselines or just competitive. The multilingual setting adds another layer: phonotactic differences and speaking-rate variation across languages could expose gaps in implicit alignment that a single ConvNeXt stage does not fully close. This paper is aimed at TTS practitioners who need low RTF and easy multilingual deployment. Readers who already follow flow-matching work will find the sampling trick and the released implementation useful to try. The work is coherent on its own terms and the open release gives it external verifiability, so it deserves a serious referee to check the missing metrics and ablations.

Referee Report

3 major / 2 minor

Summary. The paper introduces F5-TTS, a fully non-autoregressive TTS system based on flow matching with a Diffusion Transformer (DiT). It simplifies the design by padding text inputs with filler tokens to match speech length (building on E2 TTS), refines the text representation with ConvNeXt for better alignment, and proposes an inference-time Sway Sampling strategy. The work claims faster training, an RTF of 0.15, highly natural zero-shot multilingual synthesis, seamless code-switching, and efficient speed control on a public 100K-hour dataset, with code and checkpoints released.

Significance. If the empirical claims hold, the work would be significant for efficient TTS by demonstrating that simplified conditioning (filler padding + ConvNeXt) can replace duration models and explicit alignment while achieving competitive RTF and zero-shot performance on large-scale multilingual data. The release of code supports reproducibility and the Sway Sampling strategy offers a general inference improvement for flow-matching models.

major comments (3)

[§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.
[Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.
[Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.

minor comments (2)

[Abstract] The title uses an informal acronym (F5-TTS) that should be expanded on first use in the abstract and introduction for clarity.
[Introduction] The E2 TTS reference is mentioned but lacks a full citation; add the complete bibliographic entry.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are needed to strengthen the claims, we will incorporate the suggested analyses and clarifications in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (model architecture): The central claim that ConvNeXt refinement of padded filler tokens suffices for robust implicit alignment (eliminating duration models and phoneme alignment) lacks supporting ablations or alignment diagnostics (e.g., learned duration distributions or cross-attention visualizations) comparing directly to E2 TTS; without this, it is unclear whether the reported faster convergence and multilingual robustness follow from the ConvNeXt block or other unstated changes.

Authors: We appreciate the referee's observation. The manuscript already includes direct comparisons to E2 TTS demonstrating faster convergence and improved multilingual robustness on the 100K-hour dataset. To more explicitly isolate the contribution of the ConvNeXt refinement block to implicit alignment, we will add an ablation removing the ConvNeXt module, along with cross-attention visualizations and comparisons of learned duration distributions versus E2 TTS in the revised §3 and experimental sections. revision: yes
Referee: [Results] Results section (quantitative evaluation): The abstract and claims assert an RTF of 0.15, faster training, and superiority over diffusion-based TTS models, yet no specific baseline numbers, training-time comparisons, or quality metrics (MOS, WER, speaker similarity) are referenced in the provided abstract; the magnitude of improvement and statistical significance cannot be assessed without these tables or figures.

Authors: The full manuscript contains comprehensive tables and figures reporting RTF (0.15), training-time comparisons, MOS, WER, and speaker similarity against diffusion-based baselines, with statistical details. We agree the abstract would benefit from referencing key quantitative results. In the revision we will update the abstract to include the RTF value and a concise statement of the main metric improvements while respecting length constraints; the detailed tables and figures will remain in the results section. revision: partial
Referee: [Inference] Inference section (Sway Sampling): The Sway Sampling strategy is presented as improving performance and efficiency without retraining, but the paper must supply the precise algorithm or equation governing the step-size modulation and an ablation isolating its contribution to code-switching naturalness and zero-shot stability on the 100K-hour multilingual set.

Authors: We agree that the precise formulation and targeted ablation are valuable. The manuscript describes Sway Sampling at a high level; we will add the exact equation governing the step-size modulation in the inference section. We will also include a dedicated ablation study quantifying its isolated effect on code-switching naturalness and zero-shot stability using the 100K-hour multilingual data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with external citation

full rationale

The paper describes an empirical neural TTS architecture based on flow matching and DiT, where text is padded with filler tokens and refined via ConvNeXt before denoising. It cites E2 TTS only for the basic padding feasibility (an external prior result) and introduces Sway Sampling as a new inference heuristic. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to the inputs by construction. Training on 100k hours of data and released code provide external verifiability, so the central claims remain independent of any self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of text padding plus ConvNeXt for alignment and sway sampling for inference; these are domain assumptions drawn from prior TTS literature rather than new axioms.

free parameters (1)

ConvNeXt and DiT hyperparameters
Standard neural network training choices whose values are not detailed in the abstract.

axioms (1)

domain assumption Padding text with filler tokens to speech length enables flow matching to generate aligned speech without explicit duration or alignment modules
Invoked when describing the core generation process, referencing E2 TTS feasibility.

pith-pipeline@v0.9.0 · 5565 in / 1270 out tokens · 49508 ms · 2026-05-16T06:00:45.349547+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
cs.SD 2026-05 unverdicted novelty 7.0

Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
cs.SD 2026-05 unverdicted novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
cs.CL 2026-04 unverdicted novelty 7.0

A controlled pairwise evaluation framework for multilingual TTS in 10 Indic languages produces a preference leaderboard using Bradley-Terry modeling and SHAP analysis on 120K+ comparisons.
MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
cs.SD 2026-04 unverdicted novelty 7.0

MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
cs.SD 2026-04 unverdicted novelty 7.0

CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
cs.CV 2026-02 unverdicted novelty 7.0

DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
cs.SD 2026-05 unverdicted novelty 6.0

Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
cs.SD 2026-04 unverdicted novelty 6.0

A combination of phoneme romanization, targeted LoRA adaptation, and voice-prompt recovery enables commercial-class Indic TTS from a non-Indic base without acoustic retraining or commercial data.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
eess.AS 2026-04 unverdicted novelty 6.0

UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
eess.AS 2026-04 unverdicted novelty 6.0

ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
cs.SD 2026-03 accept novelty 6.0

RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
cs.SD 2025-05 unverdicted novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
cs.SD 2026-04 unverdicted novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...
ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis
cs.SD 2026-04 unverdicted novelty 4.0

ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended...
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
eess.AS 2026-04 unverdicted novelty 3.0

Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · cited by 19 Pith papers · 11 internal anchors

[2]

Keith Ito and Linda Johnson , title =

work page
[5]

International Conference on Machine Learning , pages=

A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[6]

Advances in Neural Information Processing Systems , volume=

StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Liu, Zhijun and Wang, Shuai and Zhu, Pengcheng and Bi, Mengxiao and Li, Haizhou , journal=

work page
[8]

2023 , organization=

Meister, Aleksandr and Novikov, Matvei and Karpov, Nikolay and Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris , booktitle=. 2023 , organization=

work page 2023
[9]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

work page 2021
[10]

Wang, Tianrui and Zhou, Long and Zhang, Ziqiang and Wu, Yu and Liu, Shujie and Gaur, Yashesh and Chen, Zhuo and Li, Jinyu and Wei, Furu , journal=

work page
[11]

2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Crowdmos: An approach for crowdsourcing mean opinion score studies , author=. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2011 , organization=

work page 2011
[12]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[13]

Libri-light: A benchmark for

Kahn, Jacob and Riviere, Morgane and Zheng, Weiyi and Kharitonov, Evgeny and others , booktitle=. Libri-light: A benchmark for. 2020 , organization=

work page 2020
[14]

Libriheavy: a 50,000 hours asr corpus with punctuation casing and context , author=. Proc. ICASSP , pages=. 2024 , organization=

work page 2024
[16]

Large-scale self-supervised speech representation learning for automatic speaker verification , author=. Proc. ICASSP , pages=. 2022 , organization=

work page 2022
[17]

Gao, Zhifu and Li, Zerui and Wang, Jiaming and Luo, Haoneng and Shi, Xian and Chen, Mengzhe and Li, Yabin and Zuo, Lingyun and Du, Zhihao and Xiao, Zhangyu and others , journal=

work page
[18]

Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Shunsi and Wu, Zhizheng , journal=

work page
[19]

Guo, Hao-Han and Liu, Kun and Shen, Fei-Yu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun and Xu, Kai-Tuo , journal=

work page
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Analyzing and improving the training dynamics of diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

Librispeech: an

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: an. 2015 , organization=

work page 2015
[23]

Didispeech: A large scale

Guo, Tingwei and Wen, Cheng and Jiang, Dongwei and Luo, Ne and others , booktitle=. Didispeech: A large scale. 2021 , organization=

work page 2021
[25]

Ma, Linhan and Guo, Dake and Song, Kun and Jiang, Yuepeng and Wang, Shuai and Xue, Liumeng and Xu, Weiming and Zhao, Huan and Zhang, Binbin and Xie, Lei , journal=

work page
[27]

Biometrika , volume=

Logistic-normal distributions: Some properties and uses , author=. Biometrika , volume=. 1980 , publisher=

work page 1980
[29]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[30]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[31]

2024 , organization=

Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi , booktitle=. 2024 , organization=

work page 2024
[33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Convnext v2: Co-designing and scaling convnets with masked autoencoders , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[34]

U-net: Convolutional networks for biomedical image segmentation , author=. Proc. MICCAI , pages=. 2015 , organization=

work page 2015
[35]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[36]

Advances in neural information processing systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

work page
[38]

Chen, Ricky T. Q. , title=. 2018 , url=

work page 2018
[39]

Understanding diffusion objectives as the

Kingma, Diederik and Gao, Ruiqi , journal=. Understanding diffusion objectives as the

work page
[41]

Niu, Zhikang and Chen, Sanyuan and Zhou, Long and Ma, Ziyang and Chen, Xie and Liu, Shujie , journal=

work page
[42]

2023 , organization=

Gao, Yuan and Morioka, Nobuyuki and Zhang, Yu and Chen, Nanxin , booktitle=. 2023 , organization=

work page 2023
[43]

Anastassiou, Philip and Chen, Jiawei and Chen, Jitong and Chen, Yuanzhe and others , journal=

work page
[44]

Eskimez, Sefik Emre and Wang, Xiaofei and Thakker, Manthan and Li, Canrun and others , journal=

work page
[45]

Bai, He and Likhomanenko, Tatiana and Zhang, Ruixiang and Gu, Zijin and Aldeneh, Zakaria and Jaitly, Navdeep , journal=

work page
[46]

Lee, Keon and Kim, Dong Won and Kim, Jaehyeon and Cho, Jaewoong , journal=

work page
[47]

Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz. Proc. ICASSP , pages=. 2024 , organization=

work page 2024
[48]

Voiceflow: Efficient text-to-speech with rectified flow matching , author=. Proc. ICASSP , pages=. 2024 , organization=

work page 2024
[50]

Scaling rectified flow transformers for high-resolution image synthesis , author=. Proc. ICML , year=

work page
[53]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[56]

Audiodec: An open-source streaming high-fidelity neural audio codec , author=. Proc. ICASSP , pages=. 2023 , organization=

work page 2023
[57]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

work page 2021
[63]

Song, Yakun and Chen, Zhuo and Wang, Xiaofei and Ma, Ziyang and Chen, Xie , journal=

work page
[64]

Han, Bing and Zhou, Long and Liu, Shujie and Chen, Sanyuan and Meng, Lingwei and Qian, Yanming and Liu, Yanqing and Zhao, Sheng and Li, Jinyu and Wei, Furu , journal=

work page
[66]

Du, Chenpeng and Guo, Yiwei and Wang, Hankun and Yang, Yifan and Niu, Zhikang and Wang, Shuai and Zhang, Hui and Chen, Xie and Yu, Kai , journal=

work page
[67]

Advances in neural information processing systems , volume=

Voicebox: Text-guided multilingual universal speech generation at scale , author=. Advances in neural information processing systems , volume=

work page
[68]

Chen, Sanyuan and Liu, Shujie and Zhou, Long and Liu, Yanqing and Tan, Xu and Li, Jinyu and Zhao, Sheng and Qian, Yao and Wei, Furu , journal=

work page
[73]

International Conference on Machine Learning , pages=

Grad-tts: A diffusion probabilistic model for text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[74]

Kim, Jaehyeon and Kim, Sungwon and Kong, Jungil and Yoon, Sungroh , journal=

work page
[75]

Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and others , booktitle=. Natural. 2018 , organization=

work page 2018
[77]

International Conference on Machine Learning , pages=

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[78]

Proceedings of the AAAI conference on artificial intelligence , volume=

Neural speech synthesis with transformer network , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[79]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[80]

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, et al. 2024. Seed-TTS : A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670

work page arXiv 2019
[82]

Jhon Atchison and Sheng M Shen. 1980. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261--272

work page 1980
[83]

He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. 2024. dMel : Speech tokenization made simple. arXiv preprint arXiv:2407.15835

work page arXiv 2024
[84]

He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. 2022. A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, pages 1399--1411. PMLR

work page 2022
[85]

Ricky T. Q. Chen. 2018. https://github.com/rtqichen/torchdiffeq torchdiffeq

work page 2018
[86]

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024. VALL-E 2 : Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370

work page arXiv 2024
[87]

Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. 2022. Large-scale self-supervised speech representation learning for automatic speaker verification. In Proc. ICASSP, pages 6147--6151. IEEE

work page 2022
[88]

Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438

work page internal anchor Pith review Pith/arXiv arXiv 2022
[89]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780--8794

work page 2021
[90]

Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. 2024 a . VALL-T : Decoder-only generative transducer for robust and decoding-controllable text-to-speech. arXiv preprint arXiv:2401.14321

work page arXiv 2024
[91]

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, et al. 2024 b . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[92]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, et al. 2024. E2 TTS : Embarrassingly easy fully non-autoregressive zero-shot TTS . arXiv preprint arXiv:2406.18009

work page arXiv 2024
[93]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Proc. ICML

work page 2024
[94]

Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. 2024. Flux that plays music. arXiv preprint arXiv:2409.00587

work page arXiv 2024
[95]

Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. 2023 a . E3 TTS : Easy end-to-end diffusion-based text to speech. In Proc. ASRU, pages 1--8. IEEE

work page 2023
[96]

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. 2023 b . FunASR : A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013

work page arXiv 2023
[97]

Hao-Han Guo, Kun Liu, Fei-Yu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. 2024 a . FireRedTTS : A foundation text-to-speech framework for industry-level generative speech applications. arXiv preprint arXiv:2409.03283

work page arXiv 2024
[98]

Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, et al. 2021. Didispeech: A large scale Mandarin speech corpus. In Proc. ICASSP, pages 6968--6972. IEEE

work page 2021
[99]

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. 2024 b . Voiceflow: Efficient text-to-speech with rectified flow matching. In Proc. ICASSP, pages 11121--11125. IEEE

work page 2024
[100]

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. 2024. VALL-E R : Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment. arXiv preprint arXiv:2406.07855

work page arXiv 2024
[101]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, et al. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. arXiv preprint arXiv:2407.05361

work page arXiv 2024
[102]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840--6851

work page 2020
[103]

Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[104]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460

work page 2021
[105]

Keith Ito and Linda Johnson. 2017. https://keithito.com/LJ-Speech-Dataset/ The LJ speech dataset

work page 2017
[106]

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. 2024. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100

work page arXiv 2024
[107]

Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, et al. 2020. Libri-light: A benchmark for ASR with limited or no supervision. In Proc. ICASSP, pages 7669--7673. IEEE

work page 2020

Showing first 80 references.