A block-wise generation architecture with progressive depth-wise decoding on 32-layer RVQ codes from the Mimi codec delivers 48.99 ms time-to-first-byte latency and improved voicing accuracy over regression-based TTS.
Nevertheless, the cascaded vocoder remains a major impediment to pure streaming efficiency
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding
A block-wise generation architecture with progressive depth-wise decoding on 32-layer RVQ codes from the Mimi codec delivers 48.99 ms time-to-first-byte latency and improved voicing accuracy over regression-based TTS.