DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
read the original abstract
Speech tokenizers are a key building block of fully discrete Speech LLMs.Existing tokenizers either prioritize semantic encoding,fuse semantic content with acoustic style inseparably,or achieve incomplete semantic-acoustic disentanglement.To achieve better disentanglement,we propose DSA-Tokenizer,which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints.Specifically,semantic tokens are supervised by ASR to capture linguistic content,while acoustic tokens focus on mel-spectrograms restoration to encode style.We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy,allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone.To speed up inference,we distill the DiT decoder to reduce sampling steps of inference to 4 and improve synthesis quality with GAN fine-tuning.Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement,reliable controllable voice cloning,and efficient high-fidelity generation with low WER/CER.Moreover,our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation.Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.