CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

· 2026 · cs.SD · arXiv 2601.05329

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems rely on explicit temporal alignment and complex preprocessing. To address these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific post-training and a complementary training paradigm, which internalizes text--speech alignment while ensuring high consistency between the speech before and after editing. Trained on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Extensive evaluations show that CosyEdit not only outperforms several billion-parameter language model baselines but also approaches state-of-the-art cascade systems. These results show that robust and efficient speech editing can be unlocked from a zero-shot TTS model through post-training, offering a cost-effective end-to-end solution for high-quality speech editing. Code and audio samples are available at https://cjy1018.github.io/CosyEditDemoPage/.

representative citing papers

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

citing papers explorer

Showing 1 of 1 citing paper.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 23 · internal anchor
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

fields

years

verdicts

representative citing papers

citing papers explorer