hub

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al · 2017 · cs.CL · arXiv 1703.10135

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

open full Pith review browse 14 citing papers arXiv PDF

abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

eess.AS · 2019-06-26 · unverdicted · novelty 7.0

RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

cs.LG · 2025-12-11 · unverdicted · novelty 6.0

Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

cs.CV · 2025-06-30 · unverdicted · novelty 6.0

JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.

Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation

cs.HC · 2025-04-21 · unverdicted · novelty 6.0

Script2Screen integrates scriptwriting with an interactive text-to-audiovisual pipeline for dialogues, using a user study to show it supports iterative refinement in creative writing.

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

eess.AS · 2019-07-19 · unverdicted · novelty 6.0

Two new embedding algorithms (similarity vector prediction and Frobenius-norm matrix matching) trained on subjective inter-speaker scores yield d-vectors more correlated with human similarity judgments and improve TTS quality for unseen speakers.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

cs.SD · 2026-04-07 · unverdicted · novelty 6.0

A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

cs.SD · 2019-06-21 · unverdicted · novelty 5.0

Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.

Character-Centered Dialogue Generation from Scene-Level Prompts

cs.CV · 2025-05-22 · unverdicted · novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

eess.AS · 2019-07-15 · unverdicted · novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.

Improving Performance of End-to-End ASR on Numeric Sequences

eess.AS · 2019-07-01 · unverdicted · novelty 4.0

TTS-generated numeric training data plus a compact neural denormalizer improve E2E ASR word error rates on numeric sequences by up to a factor of 8 for the longest cases.

Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

eess.AS · 2026-04-21 · unverdicted · novelty 3.0

Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

cs.CL · 2026-05-16 · unverdicted · novelty 2.0

A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.

citing papers explorer

Showing 14 of 14 citing papers.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis eess.AS · 2019-06-26 · unverdicted · none · ref 24 · internal anchor
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG · 2025-12-11 · unverdicted · none · ref 18 · internal anchor
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 69 · internal anchor
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching cs.CV · 2025-06-30 · unverdicted · none · ref 27 · internal anchor
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation cs.HC · 2025-04-21 · unverdicted · none · ref 65 · internal anchor
Script2Screen integrates scriptwriting with an interactive text-to-audiovisual pipeline for dialogues, using a user study to show it supports iterative refinement in creative writing.
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis eess.AS · 2019-07-19 · unverdicted · none · ref 13 · internal anchor
Two new embedding algorithms (similarity vector prediction and Frobenius-norm matrix matching) trained on subjective inter-speaker scores yield d-vectors more correlated with human similarity judgments and improve TTS quality for unseen speakers.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 10
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech cs.SD · 2026-04-07 · unverdicted · none · ref 6
A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling cs.SD · 2019-06-21 · unverdicted · none · ref 11 · internal anchor
Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 57 · internal anchor
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Hierarchical Sequence to Sequence Voice Conversion with Limited Data eess.AS · 2019-07-15 · unverdicted · none · ref 10 · internal anchor
Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.
Improving Performance of End-to-End ASR on Numeric Sequences eess.AS · 2019-07-01 · unverdicted · none · ref 27 · internal anchor
TTS-generated numeric training data plus a compact neural denormalizer improve E2E ASR word error rates on numeric sequences by up to a factor of 8 for the longest cases.
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment eess.AS · 2026-04-21 · unverdicted · none · ref 48
Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages cs.CL · 2026-05-16 · unverdicted · none · ref 238 · internal anchor
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.

Tacotron: Towards End-to-End Speech Synthesis

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer