hub

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

· 2024 · arXiv 2409.03283

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

eess.AS · 2025-10-22 · unverdicted · novelty 7.0

EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

eess.AS · 2026-05-29 · unverdicted · novelty 6.0

SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

eess.AS · 2026-05-16 · unverdicted · novelty 6.0

SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

cs.MM · 2025-09-30 · unverdicted · novelty 6.0

A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

eess.AS · 2025-07-12 · conditional · novelty 6.0

ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

cs.CV · 2025-06-30 · unverdicted · novelty 6.0

JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

cs.SD · 2025-05-23 · unverdicted · novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.

End-to-End Training for Discrete Token LLM based TTS System

cs.SD · 2026-06-08 · unverdicted · novelty 5.0

An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

eess.AS · 2026-06-08 · unverdicted · novelty 5.0

FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.

VoxCPM2 Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 5.0

VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

cs.SD · 2024-12-13 · unverdicted · novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis

cs.SD · 2026-07-01 · unverdicted · novelty 4.0

Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

cs.CV · 2026-04-13

citing papers explorer

Showing 21 of 21 citing papers.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 22
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 191
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis eess.AS · 2026-06-08 · unverdicted · none · ref 29
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 51
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection eess.AS · 2025-10-22 · unverdicted · none · ref 28
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech cs.SD · 2026-06-08 · unverdicted · none · ref 14
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
UniVocal: Unified Speech-Singing Code-Switching Synthesis cs.SD · 2026-06-01 · unverdicted · none · ref 45
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue eess.AS · 2026-05-29 · unverdicted · none · ref 12
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis eess.AS · 2026-05-16 · unverdicted · none · ref 43
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 4
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation cs.MM · 2025-09-30 · unverdicted · none · ref 6
A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 12
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching cs.CV · 2025-06-30 · unverdicted · none · ref 47
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 29
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
End-to-End Training for Discrete Token LLM based TTS System cs.SD · 2026-06-08 · unverdicted · none · ref 6
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 12
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
VoxCPM2 Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 13
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 35
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 97
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis cs.SD · 2026-07-01 · unverdicted · none · ref 15
Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey cs.CV · 2026-04-13 · unreviewed · ref 87

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer