SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

· 2026 · cs.SD · arXiv 2606.11611

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

representative citing papers

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

cs.SD · 2026-06-10 · unverdicted · novelty 4.0

SARA is a dual-stream VAE that integrates semantic and acoustic streams to achieve high-fidelity reconstruction and natural zero-shot TTS without complex regularizers.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations cs.SD · 2026-06-10 · unverdicted · none · ref 2 · internal anchor
SARA is a dual-stream VAE that integrates semantic and acoustic streams to achieve high-fidelity reconstruction and natural zero-shot TTS without complex regularizers.

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

fields

years

verdicts

representative citing papers

citing papers explorer