DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

Haoran Zhang; Jingbo Zhu; Junxiang Zhang; Tong Xiao; Xiaoqian Liu; Yuan Ge; Zhengkun Ge; Zhengtao Yu

arxiv: 2606.07356 · v1 · pith:RBPXQ2Q4new · submitted 2026-06-05 · 💻 cs.SD · cs.CL

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

Zhengkun Ge , Xiaoqian Liu , Haoran Zhang , Yuan Ge , Junxiang Zhang , Zhengtao Yu , Jingbo Zhu , Tong Xiao This is my paper

classification 💻 cs.SD cs.CL

keywords editingaudiodirectaudioeditinversion-freewhilediffusiontext-guidedtraining-free

0 comments

read the original abstract

Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

This paper has not been read by Pith yet.

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

discussion (0)