arxiv: 2603.24589 · v2 · submitted 2026-03-25 · 📡 eess.AS · cs.SD

Recognition: unknown

YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

Chunbo Hao , Junjie Zheng , Guobin Ma , Yuepeng Jiang , Huakang Chen , Wenjie Tian , Gongyu Chen , Zihao Chen

show 1 more author

Lei Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:18 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords singing voice synthesisdiffusion modelslyric manipulationmelody preservationcontrollable synthesispolicy optimizationcurriculum learningaudio generation

0 comments

The pith

A diffusion model generates singing voices from melody clips and new lyrics without any manual alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents YingMusic-Singer-Plus, a fully diffusion-based system for singing voice synthesis that accepts an optional timbre reference, a melody clip, and modified lyrics. It relies on curriculum learning plus Group Relative Policy Optimization to keep the melody intact and match the new lyrics. The approach reports stronger melody preservation and lyric adherence than the baseline Vevo2 on unaligned inputs. The authors also release LyricEditBench as the first dedicated evaluation set for this task.

Core claim

YingMusic-Singer-Plus is a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes an optional timbre reference, a melody-providing singing clip, and modified lyrics without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, it achieves stronger melody preservation and lyric adherence than Vevo2 while introducing LyricEditBench for standardized evaluation.

What carries the argument

Diffusion-based architecture that ingests a melody clip and new lyrics without alignment, steered by curriculum learning and Group Relative Policy Optimization to enforce melody and lyric fidelity.

Load-bearing premise

Curriculum learning combined with Group Relative Policy Optimization can produce reliable melody preservation and lyric adherence directly from unaligned melody clips and new lyrics.

What would settle it

On the LyricEditBench test set the model shows equal or lower melody similarity scores than Vevo2 when lyrics are substantially altered.

Figures

Figures reproduced from arXiv: 2603.24589 by Chunbo Hao, Gongyu Chen, Guobin Ma, Huakang Chen, Junjie Zheng, Lei Xie, Wenjie Tian, Yuepeng Jiang, Zihao Chen.

**Figure 1.** Figure 1: Overall architecture of YingMusic-Singer-Plus. Left: the training pipeline consisting of a Variational Autoencoder, a Melody Extractor, an IPA Tokenizer, and DiT-based conditional flow matching. Right: the GRPO training pipeline. a factor of 2048 into z = E(x) ∈ R T ′×D, and whose decoder D reconstructs high-fidelity audio xˆ = D(z) at inference; (2) a Melody Extractor built upon the encoder of a pretra… view at source ↗

read the original abstract

Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer-Plus, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer-Plus achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer-Plus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YingMusic-Singer-Plus gives a diffusion model plus public benchmark for lyric changes in singing synthesis without alignment, but the training tricks lack isolating tests against the baseline.

read the letter

The main thing here is a diffusion-based singing model that takes a melody clip and new lyrics, skips manual alignment, and uses curriculum learning plus Group Relative Policy Optimization to keep the melody while swapping words. They also ship LyricEditBench, which looks like the first dedicated test set for this exact task, along with code, weights, and demos. That public release is the clearest practical step forward; anyone working on controllable music audio can actually run the thing and check the benchmark themselves. The comparison to Vevo2 is the central claim, and the paper positions the new training regimen as the reason for better melody preservation and lyric adherence. What is missing is any ablation that holds the diffusion backbone and data fixed while swapping in or out curriculum learning and GRPO. Without those controls it is hard to know whether the gains come from the two techniques or from other unmentioned differences in architecture or training data. The abstract does not list numbers either, so the strength of the result rests on whatever tables appear later. For someone building singing or speech synthesis systems this is worth pulling down the repo to test, but the causal story on the training methods is still thin. I would send it to referees because the task is real, the release lowers the barrier for follow-up work, and the benchmark can be stress-tested independently.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces YingMusic-Singer-Plus, a diffusion-based model for controllable singing voice synthesis that accepts an optional timbre reference, a melody-providing clip, and modified lyrics without requiring manual alignment. It is trained using curriculum learning and Group Relative Policy Optimization (GRPO) and claims stronger melody preservation and lyric adherence than the Vevo2 baseline. The work also presents LyricEditBench, the first benchmark for melody-preserving lyric modification, with public release of code, weights, benchmark, and demos.

Significance. If the performance claims are substantiated with quantitative evidence, the approach would offer a practical advance in annotation-free melody-controllable singing synthesis, removing a key barrier in lyric editing applications. The public benchmark and model release would further support reproducibility and follow-on work in the field.

major comments (2)

[Abstract] Abstract and experimental sections: the central claim of stronger melody preservation and lyric adherence versus Vevo2 is asserted without any quantitative metrics, ablation studies, or details on the evaluation protocol, rendering the contribution of curriculum learning and GRPO unverifiable from the provided description.
[Method] Method and experiments: no isolating ablations compare curriculum learning + GRPO against standard supervised training or alternative RL objectives on the same backbone and data, leaving the causal link between these techniques and alignment-free control insecure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested quantitative details and ablations.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the central claim of stronger melody preservation and lyric adherence versus Vevo2 is asserted without any quantitative metrics, ablation studies, or details on the evaluation protocol, rendering the contribution of curriculum learning and GRPO unverifiable from the provided description.

Authors: We acknowledge that the abstract summarizes the performance claims at a high level without numerical values. The experimental sections contain comparative results, but to fully substantiate the claims and make the contributions of curriculum learning and GRPO verifiable, we will expand the revised manuscript with specific quantitative metrics (melody preservation and lyric adherence scores versus Vevo2), a detailed evaluation protocol, and supporting ablation results. revision: yes
Referee: [Method] Method and experiments: no isolating ablations compare curriculum learning + GRPO against standard supervised training or alternative RL objectives on the same backbone and data, leaving the causal link between these techniques and alignment-free control insecure.

Authors: We agree that isolating ablations are necessary to establish causality. In the revised manuscript we will add controlled ablation experiments that train the identical backbone and data with (i) standard supervised training, (ii) curriculum learning alone, (iii) GRPO with alternative RL objectives, and (iv) the full curriculum + GRPO combination, reporting melody and lyric metrics for each to isolate the effect on alignment-free control. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external baseline comparison and new benchmark

full rationale

The paper presents a diffusion-based model trained with curriculum learning and Group Relative Policy Optimization, claiming superior melody preservation and lyric adherence versus the external baseline Vevo2 on the newly introduced LyricEditBench. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing in the derivation. Performance is evaluated against an independent baseline and public benchmark, making the central claims externally falsifiable rather than self-referential by construction. The training techniques are described as applied methods without reduction to the target metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard diffusion modeling and optimization techniques from prior literature.

pith-pipeline@v0.9.0 · 5472 in / 1018 out tokens · 66927 ms · 2026-05-15T00:18:00.811539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Introduction Singing V oice Synthesis (SVS) aims to generate human-like singing voices from musical scores, lyrics, and timbre refer- ences. Modern systems [1, 2, 3, 4, 5, 6] achieve high-fidelity synthesis, yet most rely on precisely annotated paired data as- sociating each phoneme with an exact pitch contour and du- ration. While such fine-grained contr...
[2]

Methodology 2.1. Architecture Overview As shown in Figure 1, YingMusic-Singer-Plus generates singing voices at 44.1 kHz from three inputs: an optional timbre reference, a melody-providing singing clip, and corresponding modified lyrics. It comprises: (1) a Variational Autoencoder (V AE) following Stable Audio 2 [13], whose encoderEdown- samples a stereo 4...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experimental Setup Dataset.The Chinese and English subsets of Emilia [20] are used for TTS pretraining. For Singing V oice SFT, internally licensed music tracks are processed by SongFormer [21] to seg- ment structural boundaries and label function categories, dis- carding non-vocal segments. V ocal stems are then isolated us- ing Mel-band RoFormer [22]. W...
[4]

Experimental Results 4.1. Main Results We compare against Vevo2 [9], a token-based autoregressive model with disentangled timbre and melody control, where the timbre and melody references share the same clip for singing voice editing, or use separate clips for melody control. Vevo2 is the most direct baseline, as other systems operate under funda- mentall...

work page arXiv 1903
[5]

Conclusion We present YingMusic-Singer-Plus, a melody-controllable singing voice editing model that synthesizes from a timbre ref- erence, a melody-providing singing clip, and modified lyrics without manual alignment. Through curriculum training and GRPO-based reinforcement learning, YingMusic-Singer-Plus achieves superior melody preservation and lyric ad...
[6]

The authors bear full intel- lectual responsibility for all content in this manuscript

Generative AI Use Disclosure Generative AI tools are used solely for linguistic refinement and play no role in methodology, experimentation, interpretation, or the production of scientific results. The authors bear full intel- lectual responsibility for all content in this manuscript
[7]

Diffsinger: Singing voice synthesis via shallow diffusion mechanism,

J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao, “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” inAAAI. AAAI Press, 2022, pp. 11 020–11 028

2022
[8]

Rmssinger: Realistic-music-score based singing voice synthe- sis,

J. He, J. Liu, Z. Ye, R. Huang, C. Cui, H. Liu, and Z. Zhao, “Rmssinger: Realistic-music-score based singing voice synthe- sis,” inACL (Findings), ser. Findings of ACL, vol. ACL 2023. Association for Computational Linguistics, 2023, pp. 236–248

2023
[9]

Stylesinger: Style transfer for out-of- domain singing voice synthesis,

Y . Zhang, R. Huang, R. Li, J. He, Y . Xia, F. Chen, X. Duan, B. Huai, and Z. Zhao, “Stylesinger: Style transfer for out-of- domain singing voice synthesis,” inAAAI. AAAI Press, 2024, pp. 19 597–19 605

2024
[10]

Expres- sivesinger: Synthesizing expressive singing voice as an instru- ment,

F. Wang, B. Bai, Y . Deng, J. Xue, Y . Gao, and Y . Li, “Expres- sivesinger: Synthesizing expressive singing voice as an instru- ment,” inISCSLP. IEEE, 2024, pp. 304–308

2024
[11]

Visinger2+: End-to-end singing voice synthesis augmented by self-supervised learning representation,

Y . Yu, J. Shi, Y . Wu, Y . Tang, and S. Watanabe, “Visinger2+: End-to-end singing voice synthesis augmented by self-supervised learning representation,” inSLT. IEEE, 2024, pp. 719–726

2024
[12]

Tcsinger 2: Customizable multilingual zero- shot singing voice synthesis,

Y . Zhang, W. Guo, C. Pan, D. Yao, Z. Zhu, Z. Jiang, Y . Wang, T. Jin, and Z. Zhao, “Tcsinger 2: Customizable multilingual zero- shot singing voice synthesis,” inACL (Findings), ser. Findings of ACL, vol. ACL 2025. Association for Computational Linguis- tics, 2025, pp. 13 280–13 294

2025
[13]

Songcreator: Lyrics-based uni- versal song generation,

S. Lei, Y . Zhou, B. Tang, M. W. Y . Lam, F. Liu, H. Liu, J. Wu, S. Kang, Z. Wu, and H. Meng, “Songcreator: Lyrics-based uni- versal song generation,” inNeurIPS, 2024

2024
[14]

Songeditor: Adapting zero-shot song gener- ation language model as a multi-task editor,

C. Yang, S. Wang, H. Chen, J. Yu, W. Tan, R. Gu, Y . Xu, Y . Zhou, H. Zhu, and H. Li, “Songeditor: Adapting zero-shot song gener- ation language model as a multi-task editor,” inAAAI. AAAI Press, 2025, pp. 25 597–25 605

2025
[15]

Vevo2: A unified and controllable frame- work for speech and singing voice generation,

X. Zhang, J. Zhang, Y . Wang, C. Wang, Y . Chen, D. Jia, Z. Chen, and Z. Wu, “Vevo2: A unified and controllable frame- work for speech and singing voice generation,”CoRR, vol. abs/2508.16332, 2025

work page arXiv 2025
[16]

Soulx- singer: Towards high-quality zero-shot singing voice synthesis,

J. Qian, H. Meng, T. Zheng, P. Zhu, H. Lin, Y . Dai, H. Xie, W. Cao, R. Shang, J. Wu, H. Liu, H. Wen, J. Zhao, Z. Jiang, Y . Chen, S. Yin, M. Tao, J. Wei, L. Xie, and X. Wang, “Soulx- singer: Towards high-quality zero-shot singing voice synthesis,” CoRR, vol. abs/2602.07803, 2026

work page arXiv 2026
[17]

Deepseek-r1 incen- tivizes reasoning in llms through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1 incen- tivizes reasoning in llms through reinforcement learning,”Nat., vol. 645, no. 8081, pp. 633–638, 2025

2025
[18]

Gtsinger: A global multi- technique singing corpus with realistic music scores for all singing tasks,

Y . Zhang, C. Pan, W. Guo, R. Li, Z. Zhu, J. Wang, W. Xu, J. Lu, Z. Hong, C. Wang, L. Zhang, J. He, Z. Jiang, Y . Chen, C. Yang, J. Zhou, X. Cheng, and Z. Zhao, “Gtsinger: A global multi- technique singing corpus with realistic music scores for all singing tasks,” inNeurIPS, 2024

2024
[19]

Long-form music generation with latent diffusion,

Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Long-form music generation with latent diffusion,” inISMIR, 2024, pp. 429–437

2024
[20]

Diffrhythm: Blazingly fast and embarrassingly sim- ple end-to-end full-length song generation with latent diffusion,

Z. Ning, H. Chen, Y . Jiang, C. Hao, G. Ma, S. Wang, J. Yao, and L. Xie, “Diffrhythm: Blazingly fast and embarrassingly sim- ple end-to-end full-length song generation with latent diffusion,” CoRR, vol. abs/2503.01183, 2025

work page arXiv 2025
[21]

F5-TTS: A fairytaler that fakes fluent and faith- ful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faith- ful speech with flow matching,” inACL (1). Association for Computational Linguistics, 2025, pp. 6255–6271

2025
[22]

Flow-GRPO: Training Flow Matching Models via Online RL

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online RL,”CoRR, vol. abs/2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based GRPO efficiency with mixed ODE-SDE,”CoRR, vol. abs/2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Flowse-grpo: Training flow matching speech enhancement via online reinforcement learning,

H. Wang, B. Tian, Y . Jiang, Z. Pan, S. Zhao, B. Ma, D. Chen, and X. Li, “Flowse-grpo: Training flow matching speech enhancement via online reinforcement learning,”CoRR, vol. abs/2601.16483, 2026

work page arXiv 2026
[25]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, “Deepseek-v3.2: Pushing the frontier of open large language models,”CoRR, vol. abs/2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shi, Y . Wang, K. Chen, P. Zhang, and Z. Wu, “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,” inSLT. IEEE, 2024, pp. 885–890

2024
[27]

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

C. Hao, R. Yuan, J. Yao, Q. Deng, X. Bai, W. Xue, and L. Xie, “Songformer: Scaling music structure analysis with heteroge- neous supervision,”CoRR, vol. abs/2510.02797, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mel-band roformer for music source separation,

J. Wang, W. T. Lu, and M. Won, “Mel-band roformer for music source separation,”CoRR, vol. abs/2310.01809, 2023

work page arXiv 2023
[29]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inINTERSPEECH. ISCA, 2023, pp. 3222–3226

2023
[30]

pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” inINTERSPEECH. ISCA, 2023, pp. 1983–1987

2023
[31]

Dnsmos: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inICASSP. IEEE, 2021, pp. 6493–6497

2021
[32]

Dnsmos P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

——, “Dnsmos P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inICASSP. IEEE, 2022, pp. 886–890

2022
[33]

Qwen3-ASR Technical Report

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”CoRR, vol. abs/2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

RMVPE: A robust model for vocal pitch estimation in polyphonic music,

H. Wei, X. Cao, T. Dan, and Y . Chen, “RMVPE: A robust model for vocal pitch estimation in polyphonic music,” inINTER- SPEECH. ISCA, 2023, pp. 5421–5425

2023
[35]

Singing timbre popularity as- sessment based on multimodal large foundation model,

Z. Wang, R. Yuan, Z. Geng, H. Li, X. Qu, X. Li, S. Chen, H. Fu, R. B. Dannenberg, and K. Zhang, “Singing timbre popularity as- sessment based on multimodal large foundation model,” inACM Multimedia. ACM, 2025, pp. 12 227–12 236

2025