pith. sign in

arxiv: 2602.00560 · v2 · pith:5TYWZ3S7new · submitted 2026-01-31 · 💻 cs.SD · eess.AS

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

classification 💻 cs.SD eess.AS
keywords contenteditingacousticacousticseditimperceptibleintelligibilityperceptual
0
0 comments X
read the original abstract

Imperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

    cs.SD 2026-05 unverdicted novelty 6.0

    A two-stage post-training pipeline of SFT followed by editing-oriented GRPO on unpaired data improves speech editing consistency and zero-shot TTS quality.