Simple and Controllable Music Generation

Alexandre D\'efossez; David Kant; Felix Kreuk; Gabriel Synnaeve; Itai Gat; Jade Copet; Tal Remez; Yossi Adi

arxiv: 2306.05284 · v3 · pith:4KTX5K6Tnew · submitted 2023-06-08 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

Simple and Controllable Music Generation

Jade Copet , Felix Kreuk , Itai Gat , Tal Remez , David Kant , Gabriel Synnaeve , Yossi Adi , Alexandre D\'efossez This is my paper

classification 💻 cs.SD cs.AIcs.LGeess.AS

keywords musicmusicgenapproachgenerationmodelssamplesseveralstudies

0 comments

read the original abstract

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Codec-Robust Attacks on Audio LLMs
cs.SD 2026-05 unverdicted novelty 7.0

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Steering Autoregressive Music Generation with Recursive Feature Machines
cs.LG 2025-10 unverdicted novelty 7.0

MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
cs.SD 2026-07 conditional novelty 6.0

A text-to-procedural-audio system using LLMs to emit controllable categorical configurations, with live crossfading generator and three interchangeable backends for uninterrupted performance.
SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation
physics.ins-det 2026-06 unverdicted novelty 6.0

SPADE is a split-and-delay embedding technique for multi-feature autoregressive transformers that achieves competitive performance on high-granularity calorimeter shower simulation.
Codec-Robust Attacks on Audio LLMs
cs.SD 2026-05 unverdicted novelty 6.0

CodecAttack optimizes perturbations in neural audio codec latent space to reach 85.5% average target-substring ASR on compressed Opus audio while waveform baselines stay below 26%.
Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music
cs.SD 2026-05 unverdicted novelty 6.0

Introduces the first large-scale Persian music dataset and shows fine-tuned MusicGen produces compositions more aligned with Persian stylistic conventions via tag-based evaluation.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
Not that Groove: Zero-Shot Symbolic Music Editing
cs.SD 2025-05 unverdicted novelty 6.0

The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
VoxCPM2 Technical Report
cs.SD 2026-06 unverdicted novelty 5.0

VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
Woosh: A Sound Effects Foundation Model
cs.SD 2026-04 accept novelty 5.0

Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
cs.AI 2026-03 unverdicted novelty 5.0

Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
cs.SD 2026-05 unverdicted novelty 4.0

The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.