pith. sign in

arxiv: 2605.30748 · v1 · pith:IPGJGW7Mnew · submitted 2026-05-29 · 💻 cs.SD · cs.AI· eess.AS

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

classification 💻 cs.SD cs.AIeess.AS
keywords chatterbox-flashstreamingtokenzero-shotautoregressiveblockblock-diffusiondecoder
0
0 comments X
read the original abstract

We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.