Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

· 2026 · eess.AS · arXiv 2604.19330

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.

representative citing papers

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

cs.SD · 2026-05-18 · unverdicted · novelty 4.0

EnvTriCascade is a tri-stage cascaded framework using mix-consistency detection followed by dual SSL-based five-class classifiers with cross-branch attention and RawBoost augmentation, achieving 0.8266 Macro-F1 on the ESDD2 2026 challenge test set.

citing papers explorer

Showing 1 of 1 citing paper.

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge cs.SD · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
EnvTriCascade is a tri-stage cascaded framework using mix-consistency detection followed by dual SSL-based five-class classifiers with cross-branch attention and RawBoost augmentation, achieving 0.8266 Macro-F1 on the ESDD2 2026 challenge test set.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

fields

years

verdicts

representative citing papers

citing papers explorer