Tacotron: Towards End-to-End Speech Synthesis

Daisy Stanton; Navdeep Jaitly; Quoc Le; Rif A. Saurous; RJ Skerry-Ryan; Rob Clark; Ron J. Weiss; Samy Bengio; Yannis Agiomyrgiannakis; Ying Xiao

arxiv: 1703.10135 · v2 · pith:KRV4RSGWnew · submitted 2017-03-29 · 💻 cs.CL · cs.LG· cs.SD

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang , RJ Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao

show 6 more authors

Zhifeng Chen Samy Bengio Quoc Le Yannis Agiomyrgiannakis Rob Clark Rif A. Saurous

This is my paper

classification 💻 cs.CL cs.LGcs.SD

keywords tacotronmodelspeechsynthesisaudioend-to-endsystemtext

0 comments

read the original abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
eess.AS 2019-06 unverdicted novelty 7.0

RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
cs.SD 2026-04 unverdicted novelty 6.0

A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
cs.LG 2025-12 unverdicted novelty 6.0

Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
cs.CV 2025-06 unverdicted novelty 6.0

JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation
cs.HC 2025-04 unverdicted novelty 6.0

Script2Screen integrates scriptwriting with an interactive text-to-audiovisual pipeline for dialogues, using a user study to show it supports iterative refinement in creative writing.
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
eess.AS 2019-07 unverdicted novelty 6.0

Two new embedding algorithms (similarity vector prediction and Frobenius-norm matrix matching) trained on subjective inter-speaker scores yield d-vectors more correlated with human similarity judgments and improve TTS...
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
cs.SD 2019-06 unverdicted novelty 5.0

Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.
Character-Centered Dialogue Generation from Scene-Level Prompts
cs.CV 2025-05 unverdicted novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Hierarchical Sequence to Sequence Voice Conversion with Limited Data
eess.AS 2019-07 unverdicted novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.
Improving Performance of End-to-End ASR on Numeric Sequences
eess.AS 2019-07 unverdicted novelty 4.0

TTS-generated numeric training data plus a compact neural denormalizer improve E2E ASR word error rates on numeric sequences by up to a factor of 8 for the longest cases.
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
eess.AS 2026-04 unverdicted novelty 3.0

Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
cs.CL 2026-05 unverdicted novelty 2.0

A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.