MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

Heyang Xue; Jianyu Chen; Xuchen Song; Yahui Zhou; Yang Li; Yanru Chen; Yu Tang

arxiv: 2508.11326 · v1 · pith:5NDYR6CPnew · submitted 2025-08-15 · 📡 eess.AS · cs.SD

MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

Heyang Xue , Xuchen Song , Yu Tang , Jianyu Chen , Yanru Chen , Yang Li , Yahui Zhou This is my paper

classification 📡 eess.AS cs.SD

keywords moe-ttstextdescriptionsout-of-domainunderstandingdescription-basedapproachdesigned

0 comments

read the original abstract

Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM frozen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
cs.SD 2026-05 unverdicted novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.