arxiv: 2505.14351 · v4 · submitted 2025-05-20 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu , Ziyue Zhang , Ban Ma-bao , Yuqing Cai , Yongbin Yu , Renzeng Duojie , Xiangxiang Wang , Fan Gao

show 2 more authors

Cheng Huang Nyima Tashi

This is my paper

classification 💻 cs.SD cs.AIcs.CLeess.AS

keywords speechfmsd-ttsdialectfew-shotmulti-dialectspeakertibetanamdo

0 comments

read the original abstract

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
cs.SD 2026-05 unverdicted novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.