pith. sign in

hub Canonical reference

MusicLM: Generating Music From Text

Canonical reference. 73% of citing Pith papers cite this work as background.

55 Pith papers citing it
Background 73% of classified citations
abstract

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

hub tools

citation-role summary

background 10 dataset 4 method 1

citation-polarity summary

representative citing papers

MusicDET: Zero-Shot AI-Generated Music Detection

cs.SD · 2026-05-18 · unverdicted · novelty 7.0

MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.

Latent Fourier Transform

cs.SD · 2026-04-20 · unverdicted · novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

DASB - Discrete Audio and Speech Benchmark

cs.SD · 2024-06-20 · unverdicted · novelty 7.0

DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.

ARIA: A Diagnostic Framework for Music Training Data Attribution

cs.SD · 2026-05-15 · unverdicted · novelty 6.0

ARIA decomposes music training data attribution into musical aspects and supplies reliability diagnostics from similarity metrics and score matrix analysis, with validation on symbolic models using counterfactual retraining.

Communicating Sound Through Natural Language

cs.LG · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.

citing papers explorer

Showing 50 of 55 citing papers.