DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Daxin Tan; Dehua Tao; Hanlin Zhang; Haochen Tan; Linqi Song; Xiao Chen; Yuchen Cao; Yunhe Li

arxiv: 2601.09239 · v6 · pith:7X4N3VUOnew · submitted 2026-01-14 · 💻 cs.SD · cs.AI· eess.AS

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Hanlin Zhang , Daxin Tan , Dehua Tao , Xiao Chen , Haochen Tan , Yunhe Li , Yuchen Cao , Linqi Song This is my paper

classification 💻 cs.SD cs.AIeess.AS

keywords semanticspeechacousticdisentanglementdsa-tokenizersemantic-acoustictokensachieve

0 comments

read the original abstract

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space
eess.AS 2026-04 unverdicted novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.