OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

Junlan Feng; Shilei Zhang; Tao Li; Wenshuo Ge; Zhichao Wang; Zihao Cui

arxiv: 2601.18094 · v2 · pith:BHLWW5DOnew · submitted 2026-01-26 · 📡 eess.AS · cs.SD

OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

Zhichao Wang , Tao Li , Wenshuo Ge , Zihao Cui , Shilei Zhang , Junlan Feng This is my paper

Pith reviewed 2026-05-22 11:32 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords voice conversionzero-shotmixture of expertsunified modelspeech synthesissinging voice conversionprosody conditioning

0 comments

The pith

A single zero-shot model unifies linguistic-preserving, expressive, and singing voice conversion without trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OneVoice as a unified framework that performs zero-shot voice conversion across three scenarios—linguistic-preserving speech, expressive speech, and singing—using one model instead of specialized systems. It relies on a continuous language model trained via VAE-free next-patch diffusion and introduces a Mixture-of-Experts architecture with dual-path routing to separate shared conversion knowledge from scenario-specific expressivity. A two-stage progressive training process, including foundational pre-training followed by LoRA-based domain expert enhancement, addresses the imbalance between abundant speech data and scarce singing data. If the approach holds, practitioners could replace multiple dedicated models with a single flexible system that supports high-fidelity output and rapid decoding.

Core claim

OneVoice achieves performance that matches or surpasses specialized models across linguistic-preserving, expressive, and singing voice conversion by combining a Mixture-of-Experts design with dual-path routing for shared and scenario-aware experts, gated fusion of scenario-specific prosodic features at every layer, and a two-stage training regime that uses LoRA-based domain experts to mitigate data imbalance while preserving a fast 2-step decoding option.

What carries the argument

Mixture-of-Experts with dual-path routing (shared expert isolation plus scenario-aware domain expert assignment using global-local cues) that explicitly separates shared conversion knowledge from scenario-specific expressivity, augmented by gated per-layer fusion of prosodic features.

If this is right

The same model delivers competitive results in linguistic-preserving speech conversion, expressive speech conversion, and singing voice conversion.
Scenario control remains flexible through the routing and prosody mechanisms.
Decoding can be reduced to as few as two steps while retaining quality.
The architecture supports high-fidelity sequence modeling without a VAE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment in resource-limited settings becomes simpler because one set of weights covers multiple voice conversion use cases.
The routing design may generalize to other audio domains where shared structure coexists with distinct stylistic requirements, such as music style transfer.
Further scaling the expert count or training data could reveal whether unification remains stable when singing data grows closer to speech volume.

Load-bearing premise

The two-stage progressive training with LoRA-based domain experts can sufficiently alleviate the data imbalance between abundant speech and scarce singing data to enable high performance in all three scenarios without trade-offs.

What would settle it

Listening tests or objective metrics showing that OneVoice underperforms a dedicated singing voice conversion model on melody or pitch accuracy when the LoRA domain experts are removed would indicate the unification claim does not hold.

Figures

Figures reproduced from arXiv: 2601.18094 by Junlan Feng, Shilei Zhang, Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui.

**Figure 2.** Figure 2: The details of LM block. 3.1 Conditional MoE Architecture As outlined in Section 1, diverse expressions—such as EVC and SVC —can be viewed as deriving from the fundamental LVC by augmenting linguistic content with either paralinguistic prosody or melodic contours. This perspective guides the core design of OneVoice. To jointly model shared and scenario-specific knowledge, we employ a language model integ… view at source ↗

**Figure 3.** Figure 3: The LocalDiT block. rectional local transformer as the diffusion head, called LocalDiT. Given the hidden state ht ′ from LM, LocalDiT iteratively produces the mel spectrogram within t ′ -th patch over diffusion time t¯, where the last historical patch is also employed as a prefix input for better generation quality. This modeling process can be formulated as: pθ(ar·t ′+1:r·t ′+r|t, h ¯ t ′ , ar·(t ′−1)+1… view at source ↗

read the original abstract

Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Audio samples are available on demo page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents OneVoice, a unified zero-shot voice conversion framework that handles linguistic-preserving, expressive, and singing scenarios in a single model. It builds on a continuous language model trained with VAE-free next-patch diffusion, employs a Mixture-of-Experts architecture with dual-path routing (shared expert isolation plus scenario-aware domain expert assignment using global-local cues), fuses scenario-specific prosodic features via a gated mechanism, and uses two-stage progressive training (foundational pre-training followed by LoRA-based domain-expert enhancement) to address abundant-speech versus scarce-singing data imbalance. The central empirical claim is that OneVoice matches or surpasses specialized models across all three scenarios while enabling flexible scenario control and a fast 2-step decoding variant.

Significance. If the performance claims and absence of trade-offs are substantiated, the work could consolidate three previously fragmented VC subfields into one deployable model, reducing the need for scenario-specific systems and offering practical gains in flexibility and inference speed. The explicit separation of shared conversion knowledge from scenario-specific expressivity via MoE and the progressive LoRA strategy constitute a clear methodological contribution, provided they are backed by ablations and reproducible metrics.

major comments (2)

[Training Procedure and Experimental Results] The unification claim without performance trade-offs rests on the two-stage progressive training successfully injecting singing-specific expressivity while preserving speech metrics. The manuscript provides no data-volume ratios between speech and singing corpora, no LoRA rank or adapter configuration details, and no ablation tables comparing speech-only metrics before versus after the LoRA singing-enhancement stage. Without these, it is impossible to verify that the “no trade-offs” assertion holds.
[Experiments] Table or figure reporting cross-scenario comparisons: the claim that OneVoice matches or surpasses specialized models requires explicit numerical results (e.g., MOS, WER, speaker similarity, F0 correlation) against named baselines for each of the three scenarios, together with statistical significance tests. The current presentation leaves the magnitude of any gains or equivalences unclear.

minor comments (2)

[Method] Notation for the dual-path routing and gated prosody fusion should be introduced with a single diagram or equation block early in the method section to improve readability.
[Experiments] The fast-decoding (2-step) variant is mentioned only briefly; a short paragraph or table quantifying the quality-speed trade-off relative to the full model would strengthen the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with clarifications and commit to revisions that will incorporate the requested details and tables to strengthen the manuscript.

read point-by-point responses

Referee: [Training Procedure and Experimental Results] The unification claim without performance trade-offs rests on the two-stage progressive training successfully injecting singing-specific expressivity while preserving speech metrics. The manuscript provides no data-volume ratios between speech and singing corpora, no LoRA rank or adapter configuration details, and no ablation tables comparing speech-only metrics before versus after the LoRA singing-enhancement stage. Without these, it is impossible to verify that the “no trade-offs” assertion holds.

Authors: We agree that these specifics are necessary to substantiate the no-trade-offs claim. In the revised manuscript we will report the exact data-volume ratios between the speech and singing corpora. We will also document the LoRA rank and full adapter configuration in the training procedure section. In addition, we will add an ablation table that directly compares speech metrics (WER and speaker similarity) before versus after the LoRA singing-enhancement stage, thereby allowing readers to verify that speech performance is preserved. revision: yes
Referee: [Experiments] Table or figure reporting cross-scenario comparisons: the claim that OneVoice matches or surpasses specialized models requires explicit numerical results (e.g., MOS, WER, speaker similarity, F0 correlation) against named baselines for each of the three scenarios, together with statistical significance tests. The current presentation leaves the magnitude of any gains or equivalences unclear.

Authors: We acknowledge that the current presentation summarizes results without the requested level of numerical detail or statistical tests. In the revision we will insert a new table (or expanded figure) that reports explicit values for MOS, WER, speaker similarity, and F0 correlation for OneVoice against the named specialized baselines in each of the three scenarios. We will also include the results of statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) to quantify the observed equivalences or improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of a proposed architecture rather than self-referential derivations.

full rationale

The paper describes a model architecture (MoE with dual-path routing, gated prosody fusion) and a two-stage training procedure (foundational pre-training followed by LoRA-based domain experts) to unify three VC scenarios. All central claims are supported by experimental comparisons showing performance matching or exceeding specialized models, with audio samples provided for external verification. No equations, first-principles derivations, or predictions that reduce to fitted parameters by construction appear in the provided text. The data-imbalance alleviation via progressive training is presented as a design choice whose effectiveness is tested empirically, not defined into existence. Any self-citations (if present in the full manuscript) are not load-bearing for the unification result, as the outcome remains falsifiable through independent metrics and listening tests.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. The MoE routing and two-stage training are presented as design choices rather than formally axiomatized elements.

pith-pipeline@v0.9.0 · 5778 in / 1114 out tokens · 35497 ms · 2026-05-22T11:32:07.701752+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity... two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that OneVoice matches or surpasses specialized models across all three scenarios

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Streaming voice con- version via intermediate bottleneck features and non- streaming teacher guidance

[Chenet al., 2023 ] Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, and Yuxuan Wang. Streaming voice con- version via intermediate bottleneck features and non- streaming teacher guidance. InICASSP, pages 1–5,

work page 2023
[2]

Yingmusic-svc: Real- world robust zero-shot singing voice conversion with flow- grpo and singing-specific inductive biases.Arxiv,

[Chenet al., 2025 ] Gongyu Chen, Xiaoyu Zhang, Zhen- qiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei- Qiang Zhang, and Zihao Chen. Yingmusic-svc: Real- world robust zero-shot singing voice conversion with flow- grpo and singing-specific inductive biases.Arxiv,

work page 2025
[3]

Neural analysis and synthesis: Reconstructing speech from self-supervised representations

[Choiet al., 2021 ] Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Lee, Hoon Heo, and Kyogu Lee. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. InNeurIPS, pages 16251–16265,

work page 2021
[4]

[Daiet al., 2024 ] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate ex- pert specialization in mixture-of-experts language models. Arxiv,

work page 2024
[5]

The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech

[Duanet al., 2013 ] Zhiyan Duan, Haotian Fang, Bo Li, Khe Chai Sim, and Ye Wang. The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech. InAPSIPA ASC, pages 1–9,

work page 2013
[6]

Moshi: a speech-text foundation model for real-time dialogue

[D´efossezet al., 2024 ] Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P´erez, Herv´e J´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. Arxiv,

work page 2024
[7]

Switch transformers: scaling to trillion parame- ter models with simple and efficient sparsity.Journal of Machine Learning Research, 23,

[Feduset al., 2022 ] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parame- ter models with simple and efficient sparsity.Journal of Machine Learning Research, 23,

work page 2022
[8]

Zico Kolter, and Kaiming He

[Genget al., 2025 ] Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.Arxiv,

work page 2025
[9]

Bigvgan: A universal neural vocoder with large-scale training.Arxiv,

[gil Leeet al., 2023 ] Sang gil Lee, Wei Ping, Boris Gins- burg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.Arxiv,

work page 2023
[10]

Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation.Transactions on Audio, Speech and Language Processing, 33:4044–4054,

[Heet al., 2025 ] Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation.Transactions on Audio, Speech and Language Processing, 33:4044–4054,

work page 2025
[11]

HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units.Transactions on Audio, Speech, and Language Processing, 29:3451–3460,

[Hsuet al., 2021 ] Wei-Ning Hsu, Benjamin Bolte, Yao- Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhut- dinov, and Abdelrahman Mohamed. HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units.Transactions on Audio, Speech, and Language Processing, 29:3451–3460,

work page 2021
[12]

LoRA: Low-rank adaptation of large language models

[Huet al., 2022 ] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,

work page 2022
[13]

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

[Huanget al., 2021 ] Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. InACM MM, page 3945–3954,

work page 2021
[14]

The singing voice conversion challenge

[Huanget al., 2023 ] Wen-Chin Huang, Lester Phillip Vio- leta, Songxiang Liu, Jiatong Shi, and Tomoki Toda. The singing voice conversion challenge

work page 2023
[15]

DiTAR: Diffusion transformer autoregressive modeling for speech generation

[Jiaet al., 2025 ] Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. DiTAR: Diffusion transformer autoregressive modeling for speech generation. InICML,

work page 2025
[16]

Ref-vc: Robust, expressive and fast zero-shot voice conversion with diffusion trans- formers.Arxiv,

[Jianget al., 2025 ] Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, and Lei Xie. Ref-vc: Robust, expressive and fast zero-shot voice conversion with diffusion trans- formers.Arxiv,

work page 2025
[17]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion mod- els

[Juet al., 2024 ] Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and sheng zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion mod- els. InICML,

work page 2024
[18]

Efficient multilingual asr finetuning via lora language ex- perts

[Liet al., 2025 ] Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, and Yanmin Qian. Efficient multilingual asr finetuning via lora language ex- perts. InInterspeech, pages 1138–1142,

work page 2025
[19]

[Lipmanet al., 2023 ] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR,

work page 2023
[20]

Transferring source style in non-parallel voice conversion

[Liuet al., 2020 ] Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. Transferring source style in non-parallel voice conversion. InInterspeech, pages 4721–4725,

work page 2020
[21]

Learning the beauty in songs: Neural singing voice beautifier

[Liuet al., 2022 ] Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, and Zhou Zhao. Learning the beauty in songs: Neural singing voice beautifier. InACL, pages 7970–7983,

work page 2022
[22]

Zero-shot voice conversion with diffusion transformers.Arxiv,

[Liu, 2024] Songting Liu. Zero-shot voice conversion with diffusion transformers.Arxiv,

work page 2024
[23]

Hdmole: Mixture of lora experts with hi- erarchical routing and dynamic thresholds for fine-tuning llm-based asr models

[Muet al., 2025 ] Bingshen Mu, Kun Wei, Qijie Shao, Yong Xu, and Lei Xie. Hdmole: Mixture of lora experts with hi- erarchical routing and dynamic thresholds for fine-tuning llm-based asr models. InICASSP, pages 1–5,

work page 2025
[24]

Scalable diffusion models with transformers.Arxiv,

[Peebles and Xie, 2023] William Peebles and Saining Xie. Scalable diffusion models with transformers.Arxiv,

work page 2023
[25]

Vibevoice technical report.Arxiv,

[Penget al., 2025 ] Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Wei- jiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.Arxiv,

work page 2025
[26]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

[Shazeeret al., 2017 ] Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR,

work page 2017
[27]

Singing voice data scaling-up: An intro- duction to ace-opencpop and ace-kising.Arxiv,

[Shiet al., 2024 ] Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, and Shinji Watanabe. Singing voice data scaling-up: An intro- duction to ace-opencpop and ace-kising.Arxiv,

work page 2024
[28]

Li, Hao Wang, Shiyin Kang, and H

[Sunet al., 2016 ] Lifa Sun, K. Li, Hao Wang, Shiyin Kang, and H. Meng. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. InICME, pages 1–6,

work page 2016
[29]

Multimodal latent language modeling with next-token diffusion.Arxiv,

[Sunet al., 2024 ] Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.Arxiv,

work page 2024
[30]

Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis.Arxiv,

[Wanget al., 2022 ] Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, and Mengxiao Bi. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis.Arxiv,

work page 2022
[31]

Metis: A foundation speech generation model with masked generative pre-training.Arxiv,

[Wanget al., 2025 ] Yuancheng Wang, Jiachen Zheng, Ju- nan Zhang, Xueyao Zhang, Huan Liao, and Zhizheng Wu. Metis: A foundation speech generation model with masked generative pre-training.Arxiv,

work page 2025
[32]

Moe-tts: Enhancing out-of-domain text understanding for description-based tts via mixture-of-experts.Arxiv,

[Xueet al., 2025 ] Heyang Xue, Xuchen Song, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, and Yahui Zhou. Moe-tts: Enhancing out-of-domain text understanding for description-based tts via mixture-of-experts.Arxiv,

work page 2025
[33]

Uniaudio: An audio foundation model to- ward universal audio generation

[Yanget al., 2023 ] Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, and Helen Meng. Uniaudio: An audio foundation model to- ward universal audio generation. InICML,

work page 2023
[34]

Llasa: Scal- ing train-time and inference-time compute for llama-based speech synthesis.Arxiv,

[Yeet al., 2025 ] Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xin- sheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. Llasa: Scal- ing train-time and inference-time compute for llama-based speech synthesis.Arxiv,

work page 2025
[35]

Megabyte: Predicting million-byte sequences with multi- scale transformers

[Yuet al., 2023 ] Lili Yu, D ´aniel Simig, Colin Flaherty, Ar- men Aghajanyan, Luke Zettlemoyer, and Mike Lewis. Megabyte: Predicting million-byte sequences with multi- scale transformers. InNeurIPS, volume 36, pages 78808– 78823,

work page 2023
[36]

Takin-VC: Expressive zero-shot voice conversion via adaptive hybrid content encoding and en- hanced timbre modeling

[Yuguanget al., 2025 ] Yang Yuguang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, and Jianjun Zhao. Takin-VC: Expressive zero-shot voice conversion via adaptive hybrid content encoding and en- hanced timbre modeling. InACL, pages 1731–1742,

work page 2025
[37]

SoundStream: An end-to-end neural audio codec.Trans- actions on Audio, Speech, and Language Processing, 30:495–507,

[Zeghidouret al., 2021 ] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An end-to-end neural audio codec.Trans- actions on Audio, Speech, and Language Processing, 30:495–507,

work page 2021
[38]

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus

[Zhanget al., 2022 ] Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, and Zhou Zhao. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. InNeurIPS, vol- ume 35, pages 6914–6926,

work page 2022
[39]

Transfusion: Predict the next token and diffuse images with one multi-modal model

[Zhouet al., 2025 ] Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Ja- cob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025

work page 2025

[1] [1]

Streaming voice con- version via intermediate bottleneck features and non- streaming teacher guidance

[Chenet al., 2023 ] Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, and Yuxuan Wang. Streaming voice con- version via intermediate bottleneck features and non- streaming teacher guidance. InICASSP, pages 1–5,

work page 2023

[2] [2]

Yingmusic-svc: Real- world robust zero-shot singing voice conversion with flow- grpo and singing-specific inductive biases.Arxiv,

[Chenet al., 2025 ] Gongyu Chen, Xiaoyu Zhang, Zhen- qiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei- Qiang Zhang, and Zihao Chen. Yingmusic-svc: Real- world robust zero-shot singing voice conversion with flow- grpo and singing-specific inductive biases.Arxiv,

work page 2025

[3] [3]

Neural analysis and synthesis: Reconstructing speech from self-supervised representations

[Choiet al., 2021 ] Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Lee, Hoon Heo, and Kyogu Lee. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. InNeurIPS, pages 16251–16265,

work page 2021

[4] [4]

[Daiet al., 2024 ] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate ex- pert specialization in mixture-of-experts language models. Arxiv,

work page 2024

[5] [5]

The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech

[Duanet al., 2013 ] Zhiyan Duan, Haotian Fang, Bo Li, Khe Chai Sim, and Ye Wang. The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech. InAPSIPA ASC, pages 1–9,

work page 2013

[6] [6]

Moshi: a speech-text foundation model for real-time dialogue

[D´efossezet al., 2024 ] Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P´erez, Herv´e J´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. Arxiv,

work page 2024

[7] [7]

Switch transformers: scaling to trillion parame- ter models with simple and efficient sparsity.Journal of Machine Learning Research, 23,

[Feduset al., 2022 ] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parame- ter models with simple and efficient sparsity.Journal of Machine Learning Research, 23,

work page 2022

[8] [8]

Zico Kolter, and Kaiming He

[Genget al., 2025 ] Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.Arxiv,

work page 2025

[9] [9]

Bigvgan: A universal neural vocoder with large-scale training.Arxiv,

[gil Leeet al., 2023 ] Sang gil Lee, Wei Ping, Boris Gins- burg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.Arxiv,

work page 2023

[10] [10]

Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation.Transactions on Audio, Speech and Language Processing, 33:4044–4054,

[Heet al., 2025 ] Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation.Transactions on Audio, Speech and Language Processing, 33:4044–4054,

work page 2025

[11] [11]

HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units.Transactions on Audio, Speech, and Language Processing, 29:3451–3460,

[Hsuet al., 2021 ] Wei-Ning Hsu, Benjamin Bolte, Yao- Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhut- dinov, and Abdelrahman Mohamed. HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units.Transactions on Audio, Speech, and Language Processing, 29:3451–3460,

work page 2021

[12] [12]

LoRA: Low-rank adaptation of large language models

[Huet al., 2022 ] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,

work page 2022

[13] [13]

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

[Huanget al., 2021 ] Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. InACM MM, page 3945–3954,

work page 2021

[14] [14]

The singing voice conversion challenge

[Huanget al., 2023 ] Wen-Chin Huang, Lester Phillip Vio- leta, Songxiang Liu, Jiatong Shi, and Tomoki Toda. The singing voice conversion challenge

work page 2023

[15] [15]

DiTAR: Diffusion transformer autoregressive modeling for speech generation

[Jiaet al., 2025 ] Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. DiTAR: Diffusion transformer autoregressive modeling for speech generation. InICML,

work page 2025

[16] [16]

Ref-vc: Robust, expressive and fast zero-shot voice conversion with diffusion trans- formers.Arxiv,

[Jianget al., 2025 ] Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, and Lei Xie. Ref-vc: Robust, expressive and fast zero-shot voice conversion with diffusion trans- formers.Arxiv,

work page 2025

[17] [17]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion mod- els

[Juet al., 2024 ] Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and sheng zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion mod- els. InICML,

work page 2024

[18] [18]

Efficient multilingual asr finetuning via lora language ex- perts

[Liet al., 2025 ] Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, and Yanmin Qian. Efficient multilingual asr finetuning via lora language ex- perts. InInterspeech, pages 1138–1142,

work page 2025

[19] [19]

[Lipmanet al., 2023 ] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR,

work page 2023

[20] [20]

Transferring source style in non-parallel voice conversion

[Liuet al., 2020 ] Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. Transferring source style in non-parallel voice conversion. InInterspeech, pages 4721–4725,

work page 2020

[21] [21]

Learning the beauty in songs: Neural singing voice beautifier

[Liuet al., 2022 ] Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, and Zhou Zhao. Learning the beauty in songs: Neural singing voice beautifier. InACL, pages 7970–7983,

work page 2022

[22] [22]

Zero-shot voice conversion with diffusion transformers.Arxiv,

[Liu, 2024] Songting Liu. Zero-shot voice conversion with diffusion transformers.Arxiv,

work page 2024

[23] [23]

Hdmole: Mixture of lora experts with hi- erarchical routing and dynamic thresholds for fine-tuning llm-based asr models

[Muet al., 2025 ] Bingshen Mu, Kun Wei, Qijie Shao, Yong Xu, and Lei Xie. Hdmole: Mixture of lora experts with hi- erarchical routing and dynamic thresholds for fine-tuning llm-based asr models. InICASSP, pages 1–5,

work page 2025

[24] [24]

Scalable diffusion models with transformers.Arxiv,

[Peebles and Xie, 2023] William Peebles and Saining Xie. Scalable diffusion models with transformers.Arxiv,

work page 2023

[25] [25]

Vibevoice technical report.Arxiv,

[Penget al., 2025 ] Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Wei- jiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.Arxiv,

work page 2025

[26] [26]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

[Shazeeret al., 2017 ] Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR,

work page 2017

[27] [27]

Singing voice data scaling-up: An intro- duction to ace-opencpop and ace-kising.Arxiv,

[Shiet al., 2024 ] Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, and Shinji Watanabe. Singing voice data scaling-up: An intro- duction to ace-opencpop and ace-kising.Arxiv,

work page 2024

[28] [28]

Li, Hao Wang, Shiyin Kang, and H

[Sunet al., 2016 ] Lifa Sun, K. Li, Hao Wang, Shiyin Kang, and H. Meng. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. InICME, pages 1–6,

work page 2016

[29] [29]

Multimodal latent language modeling with next-token diffusion.Arxiv,

[Sunet al., 2024 ] Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.Arxiv,

work page 2024

[30] [30]

Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis.Arxiv,

[Wanget al., 2022 ] Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, and Mengxiao Bi. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis.Arxiv,

work page 2022

[31] [31]

Metis: A foundation speech generation model with masked generative pre-training.Arxiv,

[Wanget al., 2025 ] Yuancheng Wang, Jiachen Zheng, Ju- nan Zhang, Xueyao Zhang, Huan Liao, and Zhizheng Wu. Metis: A foundation speech generation model with masked generative pre-training.Arxiv,

work page 2025

[32] [32]

Moe-tts: Enhancing out-of-domain text understanding for description-based tts via mixture-of-experts.Arxiv,

[Xueet al., 2025 ] Heyang Xue, Xuchen Song, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, and Yahui Zhou. Moe-tts: Enhancing out-of-domain text understanding for description-based tts via mixture-of-experts.Arxiv,

work page 2025

[33] [33]

Uniaudio: An audio foundation model to- ward universal audio generation

[Yanget al., 2023 ] Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, and Helen Meng. Uniaudio: An audio foundation model to- ward universal audio generation. InICML,

work page 2023

[34] [34]

Llasa: Scal- ing train-time and inference-time compute for llama-based speech synthesis.Arxiv,

[Yeet al., 2025 ] Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xin- sheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. Llasa: Scal- ing train-time and inference-time compute for llama-based speech synthesis.Arxiv,

work page 2025

[35] [35]

Megabyte: Predicting million-byte sequences with multi- scale transformers

[Yuet al., 2023 ] Lili Yu, D ´aniel Simig, Colin Flaherty, Ar- men Aghajanyan, Luke Zettlemoyer, and Mike Lewis. Megabyte: Predicting million-byte sequences with multi- scale transformers. InNeurIPS, volume 36, pages 78808– 78823,

work page 2023

[36] [36]

Takin-VC: Expressive zero-shot voice conversion via adaptive hybrid content encoding and en- hanced timbre modeling

[Yuguanget al., 2025 ] Yang Yuguang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, and Jianjun Zhao. Takin-VC: Expressive zero-shot voice conversion via adaptive hybrid content encoding and en- hanced timbre modeling. InACL, pages 1731–1742,

work page 2025

[37] [37]

SoundStream: An end-to-end neural audio codec.Trans- actions on Audio, Speech, and Language Processing, 30:495–507,

[Zeghidouret al., 2021 ] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An end-to-end neural audio codec.Trans- actions on Audio, Speech, and Language Processing, 30:495–507,

work page 2021

[38] [38]

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus

[Zhanget al., 2022 ] Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, and Zhou Zhao. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. InNeurIPS, vol- ume 35, pages 6914–6926,

work page 2022

[39] [39]

Transfusion: Predict the next token and diffuse images with one multi-modal model

[Zhouet al., 2025 ] Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Ja- cob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InICLR, 2025

work page 2025