Combining audio control and style transfer using latent diffusion

David Genova; Guillaume Doras; Nils Demerl\'e; Philippe Esling

arxiv: 2408.00196 · v1 · pith:P26TXRXOnew · submitted 2024-07-31 · 💻 cs.SD · cs.LG· eess.AS· stat.ML

Combining audio control and style transfer using latent diffusion

Nils Demerl\'e , Philippe Esling , Guillaume Doras , David Genova This is my paper

classification 💻 cs.SD cs.LGeess.ASstat.ML

keywords audiocontrolstyletransferexplicitmodeltargettimbre

0 comments

read the original abstract

Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We evaluate our model on one-shot timbre transfer and MIDI-to-audio tasks on instrumental recordings and show that we outperform existing baselines in terms of audio quality and target fidelity. Furthermore, we show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems
cs.SD 2026-05 unverdicted novelty 7.0

MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.