TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Ban Ma-bao; Cheng Huang; Fan Gao; Nyima Tashi; Renzeng Duojie; Xiangxiang Wang; Yongbin Yu; Yuqing Cai; Yutong Liu; Ziyue Zhang

arxiv: 2509.18060 · v2 · submitted 2025-09-22 · 💻 cs.CL · cs.AI

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu , Ziyue Zhang , Ban Ma-bao , Renzeng Duojie , Yuqing Cai , Yongbin Yu , Xiangxiang Wang , Fan Gao

show 2 more authors

Cheng Huang Nyima Tashi

This is my paper

Pith reviewed 2026-05-18 14:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Tibetan multi-dialect TTStext-to-speech synthesisdialect conversionlow-resource speech modelingDSDR-Netspeech-to-speech dialect conversionU-Tsang Amdo Kham

0 comments

The pith

A unified TTS model generates parallel speech for three Tibetan dialects using explicit labels and specialized routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tibetan lacks large parallel speech corpora across its U-Tsang, Amdo, and Kham dialects, which restricts progress in speech technology for the language. TMD-TTS solves this by training one text-to-speech system that accepts text plus a dialect label and produces matching speech in any of the three dialects. The design adds a dialect fusion module and a Dialect-Specialized Dynamic Routing Network to model the distinct sounds and word choices of each dialect while sharing most parameters. Objective and subjective tests show the outputs are more expressive of dialect identity than those from baseline systems. The same synthetic speech is then used in a speech-to-speech dialect conversion task to confirm its practical value.

Core claim

TMD-TTS is a unified multi-dialect TTS framework that synthesizes parallel dialectal speech from explicit dialect labels by means of a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) that together capture fine-grained acoustic and linguistic variations across the U-Tsang, Amdo, and Kham dialects.

What carries the argument

Dialect fusion module together with the Dialect-Specialized Dynamic Routing Network (DSDR-Net), which dynamically routes dialect-specific information while keeping a shared model backbone.

If this is right

Produces synthetic parallel speech corpora across the three Tibetan dialects from a single trained model.
Outperforms standard single-dialect and multi-dialect TTS baselines on measures of dialectal expressiveness.
Supplies usable training material for the Speech-to-Speech Dialect Conversion task.
Lowers the data-collection burden for building multi-dialect speech systems in low-resource settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same label-plus-routing pattern could be tested on other languages that have multiple dialects but little parallel audio.
Generated dialect-specific speech could feed into dialect-aware speech recognition or machine translation pipelines.
The framework might allow zero-shot addition of a new Tibetan dialect if a small amount of labeled data becomes available later.
Cultural and educational applications could use the outputs to create dialect-matched audio content without new studio recordings.

Load-bearing premise

Explicit dialect labels plus the fusion module and DSDR-Net are sufficient to capture the acoustic and linguistic differences among the three dialects without extra dialect-specific recordings or major architecture changes.

What would settle it

A blind listening test in which native Tibetan speakers cannot reliably identify which dialect the TMD-TTS output is intended to represent, or where speech-to-speech dialect conversion accuracy using the synthetic data shows no gain over baselines trained on limited real data.

read the original abstract

Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (\"U-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TMD-TTS applies dialect labels, a fusion module, and DSDR-Net to generate Tibetan multi-dialect speech and tests it with an S2SDC task, but the abstract supplies too few experimental details to judge whether the gains are dialect-specific.

read the letter

TMD-TTS is a practical engineering effort to synthesize parallel speech in U-Tsang, Amdo, and Kham using explicit dialect labels plus a fusion module and DSDR-Net. The abstract also introduces an S2SDC validation task to check the output quality. That combination for Tibetan is the main new element relative to the prior multi-dialect TTS work mentioned, and the paper does a reasonable job laying out the data-scarcity problem and offering a unified framework that could help other low-resource settings with dialect variation.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TMD-TTS, a unified TTS framework for synthesizing speech across three Tibetan dialects (U-Tsang, Amdo, Kham) from text and explicit dialect labels. It introduces a dialect fusion module and DSDR-Net to capture fine-grained acoustic and linguistic variations, reports that the system significantly outperforms baselines on objective and subjective measures of dialectal expressiveness, and validates the outputs via a Speech-to-Speech Dialect Conversion (S2SDC) task.

Significance. If the reported gains are shown to arise specifically from dialect-conditioned modeling rather than generic multi-speaker capacity, the work would provide a practical route to parallel multi-dialect corpora for a low-resource language family. The S2SDC validation, if rigorously controlled, could also serve as a useful extrinsic test for dialect fidelity in TTS systems.

major comments (3)

[Abstract] Abstract: the claim that 'extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness' is presented without any numerical results, dataset sizes, speaker counts, evaluation metrics, or statistical tests. This information is load-bearing for the central empirical claim.
[Method / Experiments] Method and Experiments sections: the paper does not provide ablation studies or analysis showing that performance improvements require the combination of explicit dialect labels, fusion module, and DSDR-Net rather than arising from increased model capacity or speaker averaging. Without such controls, it remains possible that gains are not dialect-specific.
[Experiments / S2SDC] S2SDC validation: the task is described as 'challenging' evidence of utility, yet no details are given on conversion accuracy metrics, baseline systems for the conversion task, or controls for speaker/recording variation. This leaves the validation of synthesized-speech quality unsupported.

minor comments (2)

[Title] The title contains an extraneous quotation mark before 'U-Tsang'.
[Throughout] Ensure consistent transliteration of dialect names (U-Tsang / Ü-Tsang) and add citations to prior Tibetan or low-resource multi-dialect TTS work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness' is presented without any numerical results, dataset sizes, speaker counts, evaluation metrics, or statistical tests. This information is load-bearing for the central empirical claim.

Authors: We agree that including specific numerical results in the abstract would strengthen the presentation of our central claim. In the revised manuscript, we will update the abstract to incorporate key quantitative findings, such as the reported improvements in objective metrics (e.g., MCD, F0 RMSE) and subjective MOS scores for dialectal expressiveness, along with dataset statistics including the number of speakers and total hours per dialect. This will provide readers with immediate evidence supporting the claim. revision: yes
Referee: [Method / Experiments] Method and Experiments sections: the paper does not provide ablation studies or analysis showing that performance improvements require the combination of explicit dialect labels, fusion module, and DSDR-Net rather than arising from increased model capacity or speaker averaging. Without such controls, it remains possible that gains are not dialect-specific.

Authors: This is a valid concern. While our experiments include comparisons to multi-speaker baselines without explicit dialect conditioning, we acknowledge the need for more targeted ablations. We will add ablation studies in the revised version that systematically remove the dialect fusion module and DSDR-Net components, as well as variants without explicit dialect labels, to demonstrate that the performance gains are specifically attributable to the dialect-aware modeling rather than general increases in model capacity. revision: yes
Referee: [Experiments / S2SDC] S2SDC validation: the task is described as 'challenging' evidence of utility, yet no details are given on conversion accuracy metrics, baseline systems for the conversion task, or controls for speaker/recording variation. This leaves the validation of synthesized-speech quality unsupported.

Authors: We appreciate this feedback on the S2SDC section. To address the lack of details, we will expand the description of the S2SDC task in the revised manuscript. This will include specific metrics used for conversion accuracy (such as dialect classification accuracy and perceptual similarity scores), comparisons against baseline dialect conversion systems, and explanations of controls implemented to account for speaker and recording variations in our dataset. These additions will provide a more rigorous validation of the synthesized speech quality and utility. revision: yes

Circularity Check

0 steps flagged

No circularity: TMD-TTS is an empirical engineering proposal validated by evaluations

full rationale

The paper proposes an applied TTS architecture using explicit dialect labels, a fusion module, and DSDR-Net to synthesize multi-dialect Tibetan speech from limited corpora. Claims rest on objective/subjective metrics and an S2SDC task rather than any derivation chain. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear; the work is self-contained as an empirical contribution without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the modeling assumption that dialect identity can be injected as an explicit label and routed through specialized sub-networks; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5703 in / 1114 out tokens · 37786 ms · 2026-05-18T14:21:00.064974+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TMD-TTS ... features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DSDR-Net ... dynamically routes information to a dialect-specific sub-network based on the input dialect ID.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
cs.SD 2026-05 unverdicted novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

INTRODUCTION Tibetan, spoken by over six million people across Tibet, neighboring Chinese provinces, and parts of South Asia, is a low-resource lan- guage with three major dialects— ¨U-Tsang, Amdo, and Kham—that differ substantially in phonology, lexicon, and syntax, often resulting in limited mutual intelligibility. To facilitate cross-dialect commu- nic...

work page
[2]

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for \"U-Tsang, Amdo and Kham Speech Dataset Generation

METHODS In this work, we present TMD-TTS, a unified multi-dialect TTS model built upon Matcha-TTS [15], specifically designed for gen- erating high-quality Tibetan dialect speech datasets. Our system ad- dresses the limitations of prior approaches [17], such as the reliance arXiv:2509.18060v1 [cs.CL] 22 Sep 2025 Fig. 1: Overall framework of the proposed m...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Dataset and Evaluation DatasetWe constructed a 179-hour multi-dialect Tibetan speech corpus, including 44h ¨U-Tsang, 45h Kham, and 90h Amdo from 1,500+ speakers

EXPERIMENTS 3.1. Dataset and Evaluation DatasetWe constructed a 179-hour multi-dialect Tibetan speech corpus, including 44h ¨U-Tsang, 45h Kham, and 90h Amdo from 1,500+ speakers. The training set contains 40k samples per dialect, with 300 samples each for validation and test. SettingTMD-TTS was trained for 500k steps with Adam [19], and the vocoder follow...

work page
[4]

We design a dialect fusion module and in- troduce DSDR-Net to better control dialectal variations

CONCLUSION In this work, we propose TMD-TTS, a unified Tibetan multi-dialect TTS model that incorporates dialect representations for multi-dialect Tibetan speech synthesis. We design a dialect fusion module and in- troduce DSDR-Net to better control dialectal variations. Leveraging this model, we construct and release a large-scale, parallel Tibetan multi...

work page
[5]

A dataset of tibetan dialect speech synthesis,

Renzeng Zhuoma and Liping Zhu, “A dataset of tibetan dialect speech synthesis,”China Scientific Data, vol. 7, no. 2, pp. 20– 29, 6 2022

work page 2022
[6]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,

J. Son J. Kim, J. Kong, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in International Conference on Machine Learning, 2021

work page 2021
[7]

VITS 2: Improving Qual- ity and Efficiency of Single-Stage Text-to-Speech with Adver- sarial Learning and Architecture Design,

Jaehyeon Kong, Jiyeon Park, Bongjung Kim, Jeongmin Kim, Doyeon Kong, and Seungwon Kim, “VITS 2: Improving Qual- ity and Efficiency of Single-Stage Text-to-Speech with Adver- sarial Learning and Architecture Design,” inProceedings of the Annual Conference of the International Speech Communi- cation Association, 2023

work page 2023
[8]

Learning structured output representation using deep conditional gener- ative models,

Kihyuk Sohn, Xinchen Yan, and Honglak Lee, “Learning structured output representation using deep conditional gener- ative models,” inProceedings of the 29th International Con- ference on Neural Information Processing Systems - Volume 2, Cambridge, MA, USA, 2015, NIPS’15, p. 3483–3491, MIT Press

work page 2015
[9]

Normalizing flows for probabilistic modeling and inference,

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2021

work page 2021
[10]

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y . Liu, “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

work page 2024
[11]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023

K. Shen, Z. Ju, X. Tan, Y . Liu, Y . Leng, L. He, T. Qin, S. Zhao, and J. Bian, “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,” arXiv:2304.09116, 2023

work page arXiv 2023
[12]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” 2024

work page 2024
[13]

Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu ., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” inIn- ternational Conference on Acoustics, Speech, and Signal Pro- cessing, 2018, pp. 4779–4783

work page 2018
[14]

FastSpeech 2: Fast and High-Quality End- to-End Text-to-Speech,

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text-to-Speech,” inProceedings of the International Conference on Learning Representations, 2020

work page 2020
[15]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020

work page 2020
[16]

Bigvgan: A universal neural vocoder with large-scale training,

Sang gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” 2023

work page 2023
[17]

Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network,

Takashi Shibuya, Yuhta Takida, and Yuki Mitsufuji, “Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network,” 2024

work page 2024
[18]

V ocos: Closing the gap between time- domain and fourier-based neural vocoders for high-quality au- dio synthesis,

Hubert Siuzdak, “V ocos: Closing the gap between time- domain and fourier-based neural vocoders for high-quality au- dio synthesis,” 2024

work page 2024
[19]

Matcha-tts: A fast tts architecture with conditional flow matching,

Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz ´ekely, and Gustav Eje Henter, “Matcha-tts: A fast tts architecture with conditional flow matching,” 2024

work page 2024
[20]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” 2023

work page 2023
[21]

End-to-end speech synthesis for tibetan multidialect,

Xiaona Xu, Li Yang, Yue Zhao, and Hui Wang, “End-to-end speech synthesis for tibetan multidialect,”Complexity, vol. 2021, no. 1, 2021

work page 2021
[22]

Metricgan+: An im- proved version of metricgan for speech enhancement,

Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, and Yu Tsao, “Metricgan+: An im- proved version of metricgan for speech enhancement,” inIn- terspeech 2021, 2021

work page 2021
[23]

Adam: A method for stochastic optimization,

Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” 2017

work page 2017
[24]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Rep- resentations, 2017

work page 2017
[25]

SC- CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems,

H. Yoon, C. Kim, S. Um, H.-W. Yoon, and H.-G. Kang, “SC- CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems,”IEEE Signal Pro- cessing Letters, vol. 30, pp. 593–597, 2023

work page 2023
[26]

An algorithm for intelligibility prediction of time– frequency weighted noisy speech,

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jes- per Jensen, “An algorithm for intelligibility prediction of time– frequency weighted noisy speech,”IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011
[27]

Perceptual evaluation of speech quality (pesq)–a new method for speech quality assessment of tele- phone networks and codecs,

Antony W Rix, Jan G Beerends, Michael P Hollier, and An- dries P Hekstra, “Perceptual evaluation of speech quality (pesq)–a new method for speech quality assessment of tele- phone networks and codecs,” in2001 IEEE International Con- ference on Acoustics, Speech, and Signal Processing. IEEE, 2001, vol. 2, pp. 749–752

work page 2001
[28]

Sdr–half-baked or well done?,

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey, “Sdr–half-baked or well done?,” inICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019, pp. 626–630

work page 2019
[29]

The microsoft deep noise suppression challenge: Dataset, subjective evaluation, and challenge results,

Anmol Mittal, Sriram Lohit, Yi Xu, Chandan Li, and Yash Rao, “The microsoft deep noise suppression challenge: Dataset, subjective evaluation, and challenge results,”arXiv preprint arXiv:2008.04848, 2020

work page arXiv 2008
[30]

Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,

Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, and Seong-Whan Lee, “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,”IEEE Transactions on Affective Computing, p. 1–15, 2025

work page 2025
[31]

Clap: Isolating content from style through contrastive learn- ing with augmented prompts,

Yichao Cai, Yuhang Liu, Zhen Zhang, and Javen Qinfeng Shi, “Clap: Isolating content from style through contrastive learn- ing with augmented prompts,” inComputer Vision – ECCV 2024, Cham, 2025, pp. 130–147, Springer Nature Switzerland

work page 2024

[1] [1]

INTRODUCTION Tibetan, spoken by over six million people across Tibet, neighboring Chinese provinces, and parts of South Asia, is a low-resource lan- guage with three major dialects— ¨U-Tsang, Amdo, and Kham—that differ substantially in phonology, lexicon, and syntax, often resulting in limited mutual intelligibility. To facilitate cross-dialect commu- nic...

work page

[2] [2]

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for \"U-Tsang, Amdo and Kham Speech Dataset Generation

METHODS In this work, we present TMD-TTS, a unified multi-dialect TTS model built upon Matcha-TTS [15], specifically designed for gen- erating high-quality Tibetan dialect speech datasets. Our system ad- dresses the limitations of prior approaches [17], such as the reliance arXiv:2509.18060v1 [cs.CL] 22 Sep 2025 Fig. 1: Overall framework of the proposed m...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Dataset and Evaluation DatasetWe constructed a 179-hour multi-dialect Tibetan speech corpus, including 44h ¨U-Tsang, 45h Kham, and 90h Amdo from 1,500+ speakers

EXPERIMENTS 3.1. Dataset and Evaluation DatasetWe constructed a 179-hour multi-dialect Tibetan speech corpus, including 44h ¨U-Tsang, 45h Kham, and 90h Amdo from 1,500+ speakers. The training set contains 40k samples per dialect, with 300 samples each for validation and test. SettingTMD-TTS was trained for 500k steps with Adam [19], and the vocoder follow...

work page

[4] [4]

We design a dialect fusion module and in- troduce DSDR-Net to better control dialectal variations

CONCLUSION In this work, we propose TMD-TTS, a unified Tibetan multi-dialect TTS model that incorporates dialect representations for multi-dialect Tibetan speech synthesis. We design a dialect fusion module and in- troduce DSDR-Net to better control dialectal variations. Leveraging this model, we construct and release a large-scale, parallel Tibetan multi...

work page

[5] [5]

A dataset of tibetan dialect speech synthesis,

Renzeng Zhuoma and Liping Zhu, “A dataset of tibetan dialect speech synthesis,”China Scientific Data, vol. 7, no. 2, pp. 20– 29, 6 2022

work page 2022

[6] [6]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,

J. Son J. Kim, J. Kong, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in International Conference on Machine Learning, 2021

work page 2021

[7] [7]

VITS 2: Improving Qual- ity and Efficiency of Single-Stage Text-to-Speech with Adver- sarial Learning and Architecture Design,

Jaehyeon Kong, Jiyeon Park, Bongjung Kim, Jeongmin Kim, Doyeon Kong, and Seungwon Kim, “VITS 2: Improving Qual- ity and Efficiency of Single-Stage Text-to-Speech with Adver- sarial Learning and Architecture Design,” inProceedings of the Annual Conference of the International Speech Communi- cation Association, 2023

work page 2023

[8] [8]

Learning structured output representation using deep conditional gener- ative models,

Kihyuk Sohn, Xinchen Yan, and Honglak Lee, “Learning structured output representation using deep conditional gener- ative models,” inProceedings of the 29th International Con- ference on Neural Information Processing Systems - Volume 2, Cambridge, MA, USA, 2015, NIPS’15, p. 3483–3491, MIT Press

work page 2015

[9] [9]

Normalizing flows for probabilistic modeling and inference,

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2021

work page 2021

[10] [10]

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y . Liu, “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

work page 2024

[11] [11]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023

K. Shen, Z. Ju, X. Tan, Y . Liu, Y . Leng, L. He, T. Qin, S. Zhao, and J. Bian, “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,” arXiv:2304.09116, 2023

work page arXiv 2023

[12] [12]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” 2024

work page 2024

[13] [13]

Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu ., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” inIn- ternational Conference on Acoustics, Speech, and Signal Pro- cessing, 2018, pp. 4779–4783

work page 2018

[14] [14]

FastSpeech 2: Fast and High-Quality End- to-End Text-to-Speech,

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text-to-Speech,” inProceedings of the International Conference on Learning Representations, 2020

work page 2020

[15] [15]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020

work page 2020

[16] [16]

Bigvgan: A universal neural vocoder with large-scale training,

Sang gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” 2023

work page 2023

[17] [17]

Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network,

Takashi Shibuya, Yuhta Takida, and Yuki Mitsufuji, “Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network,” 2024

work page 2024

[18] [18]

V ocos: Closing the gap between time- domain and fourier-based neural vocoders for high-quality au- dio synthesis,

Hubert Siuzdak, “V ocos: Closing the gap between time- domain and fourier-based neural vocoders for high-quality au- dio synthesis,” 2024

work page 2024

[19] [19]

Matcha-tts: A fast tts architecture with conditional flow matching,

Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz ´ekely, and Gustav Eje Henter, “Matcha-tts: A fast tts architecture with conditional flow matching,” 2024

work page 2024

[20] [20]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” 2023

work page 2023

[21] [21]

End-to-end speech synthesis for tibetan multidialect,

Xiaona Xu, Li Yang, Yue Zhao, and Hui Wang, “End-to-end speech synthesis for tibetan multidialect,”Complexity, vol. 2021, no. 1, 2021

work page 2021

[22] [22]

Metricgan+: An im- proved version of metricgan for speech enhancement,

Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, and Yu Tsao, “Metricgan+: An im- proved version of metricgan for speech enhancement,” inIn- terspeech 2021, 2021

work page 2021

[23] [23]

Adam: A method for stochastic optimization,

Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” 2017

work page 2017

[24] [24]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Rep- resentations, 2017

work page 2017

[25] [25]

SC- CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems,

H. Yoon, C. Kim, S. Um, H.-W. Yoon, and H.-G. Kang, “SC- CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems,”IEEE Signal Pro- cessing Letters, vol. 30, pp. 593–597, 2023

work page 2023

[26] [26]

An algorithm for intelligibility prediction of time– frequency weighted noisy speech,

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jes- per Jensen, “An algorithm for intelligibility prediction of time– frequency weighted noisy speech,”IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011

[27] [27]

Perceptual evaluation of speech quality (pesq)–a new method for speech quality assessment of tele- phone networks and codecs,

Antony W Rix, Jan G Beerends, Michael P Hollier, and An- dries P Hekstra, “Perceptual evaluation of speech quality (pesq)–a new method for speech quality assessment of tele- phone networks and codecs,” in2001 IEEE International Con- ference on Acoustics, Speech, and Signal Processing. IEEE, 2001, vol. 2, pp. 749–752

work page 2001

[28] [28]

Sdr–half-baked or well done?,

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey, “Sdr–half-baked or well done?,” inICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019, pp. 626–630

work page 2019

[29] [29]

The microsoft deep noise suppression challenge: Dataset, subjective evaluation, and challenge results,

Anmol Mittal, Sriram Lohit, Yi Xu, Chandan Li, and Yash Rao, “The microsoft deep noise suppression challenge: Dataset, subjective evaluation, and challenge results,”arXiv preprint arXiv:2008.04848, 2020

work page arXiv 2008

[30] [30]

Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,

Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, and Seong-Whan Lee, “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,”IEEE Transactions on Affective Computing, p. 1–15, 2025

work page 2025

[31] [31]

Clap: Isolating content from style through contrastive learn- ing with augmented prompts,

Yichao Cai, Yuhang Liu, Zhen Zhang, and Javen Qinfeng Shi, “Clap: Isolating content from style through contrastive learn- ing with augmented prompts,” inComputer Vision – ECCV 2024, Cham, 2025, pp. 130–147, Springer Nature Switzerland

work page 2024