pith. machine review for the scientific record. sign in

arxiv: 1810.12247 · v5 · submitted 2018-10-29 · 💻 cs.SD · cs.LG· eess.AS· stat.ML

Recognition: unknown

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Authors on Pith no claims yet
classification 💻 cs.SD cs.LGeess.ASstat.ML
keywords audiodatasetmusicmusicalmaestromodelingmodelsnetworks
0
0 comments X
read the original abstract

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

    cs.SD 2026-04 unverdicted novelty 7.0

    ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.

  2. Latent Fourier Transform

    cs.SD 2026-04 unverdicted novelty 7.0

    LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.