CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Hui Wang; Jiaming Zhou; Junyang Chen; Yong Qin; Yuhang Jia

arxiv: 2601.05329 · v2 · pith:ABTRISKVnew · submitted 2026-01-08 · 💻 cs.SD · eess.AS

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen , Yuhang Jia , Hui Wang , Jiaming Zhou , Yong Qin This is my paper

classification 💻 cs.SD eess.AS

keywords editingspeechmodelcosyeditend-to-endalignmentcascadeonly

0 comments

read the original abstract

Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems rely on explicit temporal alignment and complex preprocessing. To address these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific post-training and a complementary training paradigm, which internalizes text--speech alignment while ensuring high consistency between the speech before and after editing. Trained on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Extensive evaluations show that CosyEdit not only outperforms several billion-parameter language model baselines but also approaches state-of-the-art cascade systems. These results show that robust and efficient speech editing can be unlocked from a zero-shot TTS model through post-training, offering a cost-effective end-to-end solution for high-quality speech editing. Code and audio samples are available at https://cjy1018.github.io/CosyEditDemoPage/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
eess.AS 2026-06 unverdicted novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show ...