FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

· 2026 · cs.SD · arXiv 2606.19209

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

representative citing papers

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

cs.SD · 2026-06-17 · unverdicted · novelty 6.0

FineCombo-TTS learns a unified acoustic representation with a CFM-based Speech Variance Predictor for flexible precise TTS control from reference audio and text descriptions, supported by the new FineEdit paired dataset.

citing papers explorer

Showing 1 of 1 citing paper after filters.

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech cs.SD · 2026-06-17 · unverdicted · none · ref 1 · internal anchor
FineCombo-TTS learns a unified acoustic representation with a CFM-based Speech Variance Predictor for flexible precise TTS control from reference audio and text descriptions, supported by the new FineEdit paired dataset.

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

fields

years

verdicts

representative citing papers

citing papers explorer