High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Anurag Kumar; Bowen Shi; Brian Ellis; David Kant; Ernie Chang; Gael Le Lan; Sidd Srinivasan; Varun Nagaraja; Vikas Chandra; Wei-Ning Hsu

arxiv: 2407.03648 · v2 · pith:YUCKWI5Anew · submitted 2024-07-04 · 📡 eess.AS · cs.SD

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Gael Le Lan , Bowen Shi , Zhaoheng Ni , Sidd Srinivasan , Anurag Kumar , Brian Ellis , David Kant , Varun Nagaraja

show 4 more authors

Ernie Chang Wei-Ning Hsu Yangyang Shi Vikas Chandra

This is my paper

classification 📡 eess.AS cs.SD

keywords editingmodelmusicinversionlatentavailableddimdiffusion

0 comments

read the original abstract

We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the original implementation and naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that our latent inversion outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit a substantial improvement over previous state of the art for music editing. Code and model weights will be publicly made available. Samples are available at https://melodyflow.github.io.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
cs.SD 2026-05 unverdicted novelty 7.0

Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training align...
Not that Groove: Zero-Shot Symbolic Music Editing
cs.SD 2025-05 unverdicted novelty 6.0

The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.