DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Daxin Tan; Dehua Tao; Hanlin Zhang; Haochen Tan; Linqi Song; Xiao Chen; Yuchen Cao; Yunhe Li

arxiv: 2601.09239 · v4 · pith:7X4N3VUOnew · submitted 2026-01-14 · 💻 cs.SD · cs.AI· eess.AS

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Hanlin Zhang , Daxin Tan , Dehua Tao , Xiao Chen , Haochen Tan , Yunhe Li , Yuchen Cao , Linqi Song This is my paper

classification 💻 cs.SD cs.AIeess.AS

keywords semanticspeechacousticdisentanglementdsa-tokenizersemantic-acoustictokensachieve

0 comments

read the original abstract

Speech tokenizers are a key building block of fully discrete Speech LLMs.Existing tokenizers either prioritize semantic encoding,fuse semantic content with acoustic style inseparably,or achieve incomplete semantic-acoustic disentanglement.To achieve better disentanglement,we propose DSA-Tokenizer,which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints.Specifically,semantic tokens are supervised by ASR to capture linguistic content,while acoustic tokens focus on mel-spectrograms restoration to encode style.We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy,allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone.To speed up inference,we distill the DiT decoder to reduce sampling steps of inference to 4 and improve synthesis quality with GAN fine-tuning.Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement,reliable controllable voice cloning,and efficient high-fidelity generation with low WER/CER.Moreover,our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation.Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space
eess.AS 2026-04 unverdicted novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.