CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Hui Wang; Jiaming Zhou; Junyang Chen; Yongchang Gan; Yong Qin; Yuhang Jia

arxiv: 2605.25930 · v2 · pith:5P2DDVRUnew · submitted 2026-05-25 · 💻 cs.SD

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Junyang Chen , Yuhang Jia , Hui Wang , Jiaming Zhou , Yongchang Gan , Yong Qin This is my paper

Pith reviewed 2026-06-29 20:25 UTC · model grok-4.3

classification 💻 cs.SD

keywords speech editingzero-shot TTSreinforcement learningGRPOpost-traininglocal acoustic consistencygenerative speech models

0 comments

The pith

Reinforcement learning tuned for speech editing also improves zero-shot TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CosyEdit2, a two-stage post-training method that begins with supervised fine-tuning on speech editing and then applies editing-oriented Group Relative Policy Optimization using data without target speech. This setup targets the bottlenecks of imperfect paired editing data and coarse optimization signals that limit prior supervised approaches. Experiments show gains in both speech editing with stricter local acoustic consistency and zero-shot TTS performance, pointing to a mutual relationship between the tasks. A sympathetic reader would care because the method suggests ways to strengthen generative speech models by leveraging unpaired data for consistency training.

Core claim

CosyEdit2 advances speech editing by progressing from supervised editing initialization to editing-oriented Group Relative Policy Optimization over target-speech-free data. This not only substantially advances speech editing performance but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks that share a generative foundation conditioned on speech prompts.

What carries the argument

Editing-oriented Group Relative Policy Optimization (GRPO) applied after supervised fine-tuning, which supplies optimization signals for local acoustic consistency using target-speech-free data.

If this is right

Speech editing achieves stricter local acoustic consistency with surrounding unedited content than SFT alone provides.
Zero-shot TTS performance improves through the editing-oriented training signals.
The mutual relationship between speech editing and zero-shot TTS strengthens via shared conditioning on speech prompts.
Post-training can advance beyond SFT bottlenecks by using GRPO on unpaired data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines that combine editing and synthesis objectives could further exploit their mutual benefits in other conditional generative settings.
Target-speech-free data strategies might reduce reliance on scarce paired datasets for related audio generation tasks.
Local consistency enforcement via reinforcement learning could extend to other sequential editing problems such as music or video.

Load-bearing premise

That editing-oriented GRPO on target-speech-free data supplies optimization signals strong enough to overcome the limitations of imperfect paired editing data without introducing new inconsistencies in local acoustics.

What would settle it

A controlled comparison where the GRPO stage is added to the SFT baseline and zero-shot TTS metrics or local consistency scores in edited regions fail to improve or degrade relative to the SFT-only model.

Figures

Figures reproduced from arXiv: 2605.25930 by Hui Wang, Jiaming Zhou, Junyang Chen, Yongchang Gan, Yong Qin, Yuhang Jia.

**Figure 2.** Figure 2: TTS-to-Edit Prompt Construction. target speech tokens, which are then decoded into waveforms by the fixed Flow and BigVGAN modules. GRPO evaluates the generated speech at the waveform level with editing-specific rewards, rather than imitating a manually constructed target recording at the speech-token level. This avoids supervision artifacts from imperfect edit boundaries or mismatched acoustic condition… view at source ↗

**Figure 3.** Figure 3: An overview of the editing-oriented GRPO. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Spectrogram comparison between zero-shot TTS and speech editing. The region between the two red [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of rule-based edit perturbations used in the TTS-to-edit prompt synthesis pipeline, including [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Reward functions used in the editing-oriented GRPO stage. (a) The WER reward adopts an exponential [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Mel spectrogram visualization of a speech sample reconstructed by HiFT-GAN from CosyVoice2 and our [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Speech Editing Subjective Evaluation Annotation UI. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Zero-shot TTS Subjective Evaluation Annotation UI. [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a two-stage SFT-then-GRPO pipeline for speech editing on unpaired data and claims this also improves zero-shot TTS, but the abstract contains no metrics or baselines to check the gains.

read the letter

The main takeaway is a two-stage post-training setup: supervised fine-tuning on paired editing data followed by Group Relative Policy Optimization using target-speech-free examples. The authors position this as a way past the limits of imperfect paired data and coarse optimization, with the added claim that the same process strengthens zero-shot TTS and shows the tasks are more connected than previously treated.

What is actually new is the concrete choice to run editing-oriented GRPO after SFT initialization on data that lacks target speech. The abstract presents this as an empirical fix rather than a derivation from prior equations, and it avoids obvious circularity by describing a training procedure instead of fitting to the desired outcome.

The motivation around stricter local acoustic consistency for editing versus general TTS is laid out plainly. That part is coherent on its own terms.

The clear limitation is that every performance claim sits in the abstract with zero numbers, no baselines, no statistical tests, and no data details. The central assertion that GRPO supplies stronger signals than SFT alone cannot be evaluated from the text provided. If the full paper has the tables and ablations, that changes the picture; right now the evidence is missing.

This is for people already working on post-training for TTS and editing models. A reader in that narrow area might pick up the pipeline idea if the experiments check out. I would send it to peer review because the argument is internally consistent and the subfield can use more work on these task connections, even though the current version needs the results section to be usable.

Referee Report

1 major / 0 minor

Summary. The paper proposes CosyEdit2, a speech-editing model that employs a two-stage post-training pipeline: supervised fine-tuning (SFT) initialization on paired editing data, followed by editing-oriented Group Relative Policy Optimization (GRPO) on target-speech-free unpaired data. It claims this framework substantially advances speech-editing performance while also improving zero-shot TTS capability, thereby revealing a deeper mutual relationship between the two tasks.

Significance. If the empirical claims hold, the work would provide evidence that RL-based optimization on unpaired data can overcome limitations of imperfect paired data in speech editing and yield transferable gains to zero-shot TTS, supporting a bidirectional training relationship between the tasks.

major comments (1)

[Abstract] Abstract: the central claim that 'extensive experiments demonstrate' substantial advances in speech editing and unlocked zero-shot TTS performance is unsupported by any reported metrics, baselines, statistical tests, or dataset details. Without these, the empirical foundation of the mutual-relationship conclusion cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'extensive experiments demonstrate' substantial advances in speech editing and unlocked zero-shot TTS performance is unsupported by any reported metrics, baselines, statistical tests, or dataset details. Without these, the empirical foundation of the mutual-relationship conclusion cannot be assessed.

Authors: The abstract is a high-level summary and does not contain numerical results, as is conventional due to length limits. The full manuscript reports the supporting experiments, including quantitative metrics on editing consistency and zero-shot TTS, baseline comparisons, and dataset details in the Experiments section. To directly address the concern, we will revise the abstract to include one or two key quantitative highlights (e.g., relative gains on standard metrics). revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description outline a standard empirical two-stage pipeline (SFT initialization followed by GRPO on target-speech-free data) whose central claim is an observed empirical improvement in both editing and zero-shot TTS. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the text. The derivation chain is self-contained as an experimental training procedure without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard assumptions of reinforcement learning (policy optimization improves local consistency) and the existence of usable target-speech-free data; no free parameters, invented entities, or ad-hoc axioms are stated.

axioms (1)

domain assumption RL policy optimization on group-relative rewards yields finer acoustic consistency than SFT on paired data
Implicit in the claim that GRPO overcomes SFT bottlenecks

pith-pipeline@v0.9.1-grok · 5700 in / 1137 out tokens · 21565 ms · 2026-06-29T20:25:01.674909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Glm-tts technical report.arXiv preprint arXiv:2512.14291, 2025

Glm-tts technical report.arXiv preprint arXiv:2512.14291. Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, and 1 others. 2024a. Cosyvoice: A scal- able multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.CoRR. Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tiany...

work page arXiv 2025
[2]

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), pages 886–890. IEEE. 10 Yong Ren, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, and Tao Wang. 2026. Edit content, preserve acoustics: Imperceptible text-...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE. Canxiang Yan, Chunxiang Jin, Dawei Huang, Haib- ing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong ...

work page arXiv 2020
[4]

arXiv preprint arXiv:2511.21270 , year=

Multi-reward grpo for stable and prosodic single-codebook tts llms at scale.arXiv preprint arXiv:2511.21270. Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. 2026. In- dextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. InProceedings of the AAAI Confer- ence ...

work page arXiv 2026

[1] [1]

Glm-tts technical report.arXiv preprint arXiv:2512.14291, 2025

Glm-tts technical report.arXiv preprint arXiv:2512.14291. Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, and 1 others. 2024a. Cosyvoice: A scal- able multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.CoRR. Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tiany...

work page arXiv 2025

[2] [2]

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), pages 886–890. IEEE. 10 Yong Ren, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, and Tao Wang. 2026. Edit content, preserve acoustics: Imperceptible text-...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE. Canxiang Yan, Chunxiang Jin, Dawei Huang, Haib- ing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong ...

work page arXiv 2020

[4] [4]

arXiv preprint arXiv:2511.21270 , year=

Multi-reward grpo for stable and prosodic single-codebook tts llms at scale.arXiv preprint arXiv:2511.21270. Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. 2026. In- dextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. InProceedings of the AAAI Confer- ence ...

work page arXiv 2026