LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Chris Donahue; Garrison W. Cottrell; Huanru Henry Mao; Julian McAuley; Yiting Ethan Li

arxiv: 1907.04868 · v1 · pith:OUZFZGLHnew · submitted 2019-07-10 · 💻 cs.SD · cs.LG· cs.MM· eess.AS· stat.ML

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Chris Donahue , Huanru Henry Mao , Yiting Ethan Li , Garrison W. Cottrell , Julian McAuley This is my paper

Pith reviewed 2026-05-24 23:14 UTC · model grok-4.3

classification 💻 cs.SD cs.LGcs.MMeess.ASstat.ML

keywords music generationtransformerpre-trainingtransfer learningmulti-instrumentalNES-MDBLakh MIDIsymbolic music

0 comments

The pith

Pre-training a Transformer on diverse MIDI data improves generation of four-instrument NES music scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper adapts the Transformer model to generate music scores for four instruments using the NES-MDB dataset of early video game sounds. It first trains the model on the much larger but varied Lakh MIDI collection and then fine-tunes it on the NES data. The procedure produces better measured performance and better-sounding results than training only on the NES set, even though the two collections differ in instruments, sound production, and musical style. A sympathetic reader would care because music generation models need large amounts of data to learn long-term structure, and this shows one workable way to borrow from bigger mismatched collections.

Core claim

The authors establish that pre-training the multi-instrument Transformer on the Lakh MIDI dataset followed by fine-tuning on the NES-MDB dataset yields improved quantitative and qualitative performance on generating four-instrument NES scores compared to training solely on NES-MDB, despite the differences in the two corpora.

What carries the argument

The cross-domain pre-training procedure that first trains on the large heterogeneous Lakh MIDI collection before fine-tuning on NES-MDB.

If this is right

The pre-trained model records higher quantitative metrics on the NES generation task than a model trained from scratch on NES-MDB alone.
Human listeners rate the outputs from the pre-trained model as higher quality than those from the non-pre-trained baseline.
The reported gains appear even though the pre-training corpus and target corpus differ in instrumentation, synthesis method, and musical style.
The same pre-training step can be reused to improve Transformer performance on other multi-instrument generation tasks that start with limited domain data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The patterns learned during pre-training on broad MIDI may include general rules for coordinating multiple voices that apply even to very different sound palettes.
The same two-stage approach could be tested on other small symbolic music collections to see whether volume of pre-training data routinely compensates for domain mismatch.
If the benefit scales with the size of the pre-training set, then further gains might come from using even larger heterogeneous music corpora before specializing.

Load-bearing premise

Pre-training on the heterogeneous Lakh MIDI data transfers useful features to the NES-MDB four-instrument domain even though the corpora differ in instrumentation, synthesis, and musical style.

What would settle it

Training the same Transformer only on NES-MDB and obtaining equal or better quantitative metrics plus listener ratings than the pre-trained version would show that the transfer step adds no benefit.

read the original abstract

We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cross-domain pre-training from Lakh MIDI to NES-MDB lifts multi-instrument Transformer performance, but the abstract leaves the size of the gain and the reason for it unclear.

read the letter

The new piece is the specific transfer recipe: pre-train a multi-instrument Transformer on the large heterogeneous Lakh MIDI collection, then fine-tune on the smaller NES-MDB four-voice dataset. They adapt the architecture beyond the piano-only cases that were already in the literature and pick NES-MDB because its size and structure suit the model. That combination is a concrete step for anyone trying to train high-capacity sequence models on limited symbolic music data. The claim that the procedure improves both quantitative and qualitative results is stated plainly in the abstract. The paper does not overclaim broader impact. The main soft spot is that the abstract supplies no numbers, no baselines, no error bars, and no training details, so the magnitude of the improvement and whether it survives standard controls cannot be checked from the given text. The stress-test concern about domain gap is reasonable on the surface: instrumentation, synthesis, and style differ between the corpora, and a generic regularization effect from extra pre-training steps could explain part of the lift without any music-specific transfer. If the full paper includes ablations that separate those factors, the result strengthens; otherwise the central claim stays harder to interpret. This is useful reading for people already working on symbolic music generation with Transformers and small target corpora. It is not a foundational result but it is a practical incremental method worth trying. The work is coherent on its own terms and shows clear engagement with the data constraints in the field, so it deserves a serious referee who can check the tables and any controls.

Referee Report

2 major / 2 minor

Summary. The paper adapts the Transformer architecture to multi-instrumental symbolic music generation on the NES-MDB dataset of four-instrument chiptune scores. It introduces a cross-domain pre-training procedure on the larger, heterogeneous Lakh MIDI corpus and reports that this transfer learning step yields both quantitative and qualitative improvements on the target NES-MDB task despite differences in instrumentation, synthesis, and musical style.

Significance. If the empirical gains are shown to arise from transferable musical structure rather than generic optimization effects, the result would strengthen the case for cross-domain pre-training as a practical remedy for data scarcity in specialized symbolic music domains. The work also supplies a concrete demonstration of Transformer scaling to multi-track generation, which is a natural but non-trivial extension of prior piano-only results.

major comments (2)

[Abstract, §4] Abstract and §4: the central claim that pre-training on Lakh MIDI supplies useful cross-domain features is not isolated from generic pre-training effects. No control experiment (e.g., pre-training on shuffled or non-musical sequences of comparable size) is described, so the reported gains could be explained by longer effective optimization or regularization alone.
[§3, §5] §3 and §5: the quantitative evaluation lacks reported baseline numbers, dataset sizes, training details, error bars, or statistical tests. Without these, the magnitude and reliability of the claimed improvement cannot be assessed, undermining the assertion that transfer learning is responsible for the observed difference.

minor comments (2)

[§3] Notation for instrument tokens and event encoding is introduced without a consolidated table; a single reference table would improve readability.
[§5] The qualitative listening study description omits the number of raters, rating scale, and inter-rater agreement statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4: the central claim that pre-training on Lakh MIDI supplies useful cross-domain features is not isolated from generic pre-training effects. No control experiment (e.g., pre-training on shuffled or non-musical sequences of comparable size) is described, so the reported gains could be explained by longer effective optimization or regularization alone.

Authors: We agree this is a valid concern: without a non-musical control, the contribution of domain-specific structure versus generic optimization benefits cannot be fully isolated. In revision we will add an explicit discussion of this limitation in §4 and, resources permitting, a small-scale control experiment on shuffled token sequences of matched length. We will also clarify that the pre-training and fine-tuning schedules were matched in total steps to reduce the chance that the observed gains are due solely to extra optimization. revision: partial
Referee: [§3, §5] §3 and §5: the quantitative evaluation lacks reported baseline numbers, dataset sizes, training details, error bars, or statistical tests. Without these, the magnitude and reliability of the claimed improvement cannot be assessed, undermining the assertion that transfer learning is responsible for the observed difference.

Authors: We will expand §3 and §5 in the revision to include explicit dataset sizes, full training hyper-parameters, tabulated baseline numbers, error bars from multiple random seeds, and statistical significance tests comparing the pre-trained and non-pre-trained models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical transfer result on distinct corpora

full rationale

The paper reports an empirical comparison of Transformer models trained from scratch on NES-MDB versus pre-trained on Lakh MIDI then fine-tuned, with gains measured on held-out target-domain data. No equations, parameter fits renamed as predictions, or self-citation chains appear in the derivation of the central claim. The result is obtained by direct experiment on two separate datasets and is therefore falsifiable without reducing to self-definition or imported uniqueness.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that Transformer sequence modeling applies to symbolic music and that cross-domain transfer is feasible despite corpus mismatch; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Transformer architecture is capable of capturing long-term structure in sequence data such as music scores
Invoked in the abstract to justify adapting the model from piano to multi-instrument setting.
domain assumption Pre-training on heterogeneous music data can transfer useful representations to a target domain despite differences in instrumentation and style
This premise is required for the reported improvement to follow from the described procedure.

pith-pipeline@v0.9.0 · 5727 in / 1358 out tokens · 66428 ms · 2026-05-24T23:14:21.538808+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

INTRODUCTION In this paper, we extend recent results for symbolic pi- ano music generation [1] to the multi-instrumental setting. Both piano and multi-instrumental music are polyphonic, where multiple notes may be sounding at any given point in time. However, the generation of multi-instrumental music presents an additional challenge not present in the pi...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview

RELA TED WORK Music generation has been an active area of research for decades. Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview. Recent research has favored machine learning systems which automatically extract patterns from corpora of hum...

work page
[3]

DA TASETS AND TASK The NES Music Database (NES-MDB) [2] consists of ap- proximately 46 hours of music composed for the sound chip on the Nintendo Entertainment System. This dataset is enticing for research in multi-instrumental music gener- ation because (1) it is an unusually large corpus of music that was composed for a ﬁxed ensemble, and (2) it is avai...

work page
[4]

We factorize the joint probability of a musical sequence consisting ofN events (E1,

METHODOLOGY To model the event sequences outlined in the last section, we adopt a language modeling factorization. We factorize the joint probability of a musical sequence consisting ofN events (E1, . . . , EN ) into a product of conditionals: P (E1)· P (E2| E1)· . . .· P (EN| E1, . . . , EN −1). (1) This factorization is convenient because it allows for ...

work page
[5]

We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average

EXPERIMENTS We ﬁrst conduct an experiment to train Transformer- XL [28] on our event representation (Section 3.2) of NES- MDB. We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average. Because of the recurrent atten- tion mechanism in Transformer-XL, the model effectively has acces...

work page
[6]

(Standard) Transpose melodic voices by a random number of semitones between−6 and 5 (inclusive)

work page
[7]

(Standard) Adjust the speed of the piece by a random percentage between±5%

work page
[8]

Half of the time, remove a random number of instru- ments from the ensemble (leaving at least one)

work page
[9]

Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1)

Half of the time, shufﬂe the score-to-instrument alignment for the melodic instruments only (e.g., TR performs P2’s part). Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1). To conduct this experiment, we ﬁrst split the Lakh data into training and validation subsets. We then trained th...

work page
[10]

QUANTITA TIVE ANALYSIS We report the perplexity (PPL) of each model on the test set in Table 2. Perplexity is calculated by ﬁrst averaging the negative log-likelihood of each model across the test data, then exponentiating the average, i.e.,e 1 N ∑N i=1 − log qi, where qi is the likelihood assigned by a given model to the i-th event. A lower perplexity on...

work page
[11]

Turing test

USER STUDY While perplexity is a useful quantitative metric for model comparison, it is not necessarily correlated with human 0 1 2 3 4 Number of Lakh pre-training epochs 2.5 2.6 2.7NES-MDB test PPL 2.74 2.55 2.47 2.46 Figure 3. Measuring the performance improvement when doubling the amount of Lakh MIDI pre-training before ﬁne-tuning Transformer-XL on NES...

work page
[12]

For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas

PAIRING LAKHNES WITH HUMANS In addition to generating chiptunes from scratch, LakhNES can be used for a number of tasks to assist human com- posers. For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas. Composers can also provide ﬁxed rh...

work page
[13]

We developed an event-based representation suitable for this task

CONCLUSION In this paper we presented LakhNES, a method for learn- ing to generate multi-instrumental music. We developed an event-based representation suitable for this task. Train- ing powerful language models on this representation re- sults in compelling multi-instrumental music generation. We show that we can further improve results both quanti- tati...

work page
[14]

This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program

ACKNOWLEDGEMENTS Thanks to Cheng-Zhi Anna Huang, Cheng-i Wang, and Jennifer Hsu for helpful discussions regarding this work. This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program. GPUs used in this research were donated by NVIDIA

work page
[15]

Dai, Matthew D

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Mon- ica Dinculescu, and Douglas Eck. Music Transformer: Generating music with long-term structure. In Proc . ICLR, 2019

work page 2019
[16]

The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes

Chris Donahue, Huanru Henry Mao, and Julian McAuley. The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes. In Proc. ISMIR, 2018

work page 2018
[17]

Enabling factorized piano music modeling and generation with the MAESTRO dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proc. ICLR, 2019

work page 2019
[18]

Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching

Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching . PhD thesis, Columbia University, 2016

work page 2016
[19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, 2019

work page 2019
[21]

Transfer learning for music classiﬁ- cation and regression tasks

Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Transfer learning for music classiﬁ- cation and regression tasks. In Proc. ISMIR, 2017

work page 2017
[22]

Algorithmic composition: paradigms of automated music generation

Gerhard Nierhaus. Algorithmic composition: paradigms of automated music generation . Springer Science & Business Media, 2009

work page 2009
[23]

A connectionist approach to algorithmic composition

Peter M Todd. A connectionist approach to algorithmic composition. Computer Music Journal, 1989

work page 1989
[24]

Neural network music composition by prediction: Exploring the beneﬁts of psychoacous- tic constraints and multi-scale processing

Michael C Mozer. Neural network music composition by prediction: Exploring the beneﬁts of psychoacous- tic constraints and multi-scale processing. Connection Science, 1994

work page 1994
[25]

Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks

Douglas Eck and Jürgen Schmidhuber. Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks. In Proc . Neural Networks for Sig- nal Processing, 2002

work page 2002
[26]

Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription

Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. In Proc . ICML, 2012

work page 2012
[27]

Generating polyphonic music using tied parallel networks

Daniel D Johnson. Generating polyphonic music using tied parallel networks. In Proc . International Confer- ence on Evolutionary and Biologically Inspired Music and Art, 2017

work page 2017
[28]

MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation

Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation. In Proc . ISMIR, 2017

work page 2017
[29]

Performance RNN: Generating music with expressive timing and dynam- ics

Ian Simon and Sageev Oore. Performance RNN: Generating music with expressive timing and dynam- ics. https://magenta.tensorflow.org/ performance-rnn, 2017

work page 2017
[30]

DeepJ: Style-speciﬁc music generation

Huanru Henry Mao, Taylor Shin, and Garrison Cot- trell. DeepJ: Style-speciﬁc music generation. In Proc . International Conference on Semantic Computing , 2018

work page 2018
[31]

Harmonis- ing chorales by probabilistic inference

Moray Allan and Christopher Williams. Harmonis- ing chorales by probabilistic inference. In Proc . NIPS, 2005

work page 2005
[32]

Counter- point by convolution

Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counter- point by convolution. In Proc. ISMIR, 2017

work page 2017
[33]

DeepBach: A steerable model for Bach chorales generation

Gaëtan Hadjeres and François Pachet. DeepBach: A steerable model for Bach chorales generation. In Proc. ICML, 2017

work page 2017
[34]

Part-invariant model for music genera- tion and harmonization

Yujia Yan, Ethan Lustig, Joseph VanderStel, and Zhiyao Duan. Part-invariant model for music genera- tion and harmonization. In Proc. ISMIR, 2018

work page 2018
[35]

Convolutional generative adversarial networks with binary neurons for polyphonic music generation

Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proc . ISMIR , 2018

work page 2018
[36]

A hierarchical latent vector model for learning long-term structure in music

Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In Proc. ICML, 2018

work page 2018
[37]

MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment

Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi- Hsuan Yang. MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment. In Proc. AAAI, 2018

work page 2018
[38]

Adversarial audio synthesis

Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In Proc. ICLR, 2018

work page 2018
[39]

The challenge of realistic music generation: modelling raw audio at scale

Sander Dieleman, Aäron van den Oord, and Karen Si- monyan. The challenge of realistic music generation: modelling raw audio at scale. In Proc. NeurIPS, 2018

work page 2018
[40]

Christine McLeavy Payne. MuseNet. https:// openai.com/blog/musenet/, 2019

work page 2019
[41]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NIPS, 2017

work page 2017
[42]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Rus- lan Salakhutdinov. Transformer-XL: Attentive language models beyond a ﬁxed-length context. arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[43]

HARMONET: A neural net for harmonizing chorales in the style of JS Bach

Hermann Hild, Johannes Feulner, and Wolfram Men- zel. HARMONET: A neural net for harmonizing chorales in the style of JS Bach. In Proc. NIPS, 1992

work page 1992
[44]

A survey on trans- fer learning

Sinno Jialin Pan and Qiang Yang. A survey on trans- fer learning. IEEE Transactions on knowledge and data engineering, 2010

work page 2010
[45]

Long short- term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short- term memory. Neural computation, 1997

work page 1997
[46]

Hierar- chical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierar- chical neural story generation. In Proc. ACL, 2018

work page 2018

[1] [1]

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

INTRODUCTION In this paper, we extend recent results for symbolic pi- ano music generation [1] to the multi-instrumental setting. Both piano and multi-instrumental music are polyphonic, where multiple notes may be sounding at any given point in time. However, the generation of multi-instrumental music presents an additional challenge not present in the pi...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview

RELA TED WORK Music generation has been an active area of research for decades. Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview. Recent research has favored machine learning systems which automatically extract patterns from corpora of hum...

work page

[3] [3]

DA TASETS AND TASK The NES Music Database (NES-MDB) [2] consists of ap- proximately 46 hours of music composed for the sound chip on the Nintendo Entertainment System. This dataset is enticing for research in multi-instrumental music gener- ation because (1) it is an unusually large corpus of music that was composed for a ﬁxed ensemble, and (2) it is avai...

work page

[4] [4]

We factorize the joint probability of a musical sequence consisting ofN events (E1,

METHODOLOGY To model the event sequences outlined in the last section, we adopt a language modeling factorization. We factorize the joint probability of a musical sequence consisting ofN events (E1, . . . , EN ) into a product of conditionals: P (E1)· P (E2| E1)· . . .· P (EN| E1, . . . , EN −1). (1) This factorization is convenient because it allows for ...

work page

[5] [5]

We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average

EXPERIMENTS We ﬁrst conduct an experiment to train Transformer- XL [28] on our event representation (Section 3.2) of NES- MDB. We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average. Because of the recurrent atten- tion mechanism in Transformer-XL, the model effectively has acces...

work page

[6] [6]

(Standard) Transpose melodic voices by a random number of semitones between−6 and 5 (inclusive)

work page

[7] [7]

(Standard) Adjust the speed of the piece by a random percentage between±5%

work page

[8] [8]

Half of the time, remove a random number of instru- ments from the ensemble (leaving at least one)

work page

[9] [9]

Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1)

Half of the time, shufﬂe the score-to-instrument alignment for the melodic instruments only (e.g., TR performs P2’s part). Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1). To conduct this experiment, we ﬁrst split the Lakh data into training and validation subsets. We then trained th...

work page

[10] [10]

QUANTITA TIVE ANALYSIS We report the perplexity (PPL) of each model on the test set in Table 2. Perplexity is calculated by ﬁrst averaging the negative log-likelihood of each model across the test data, then exponentiating the average, i.e.,e 1 N ∑N i=1 − log qi, where qi is the likelihood assigned by a given model to the i-th event. A lower perplexity on...

work page

[11] [11]

Turing test

USER STUDY While perplexity is a useful quantitative metric for model comparison, it is not necessarily correlated with human 0 1 2 3 4 Number of Lakh pre-training epochs 2.5 2.6 2.7NES-MDB test PPL 2.74 2.55 2.47 2.46 Figure 3. Measuring the performance improvement when doubling the amount of Lakh MIDI pre-training before ﬁne-tuning Transformer-XL on NES...

work page

[12] [12]

For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas

PAIRING LAKHNES WITH HUMANS In addition to generating chiptunes from scratch, LakhNES can be used for a number of tasks to assist human com- posers. For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas. Composers can also provide ﬁxed rh...

work page

[13] [13]

We developed an event-based representation suitable for this task

CONCLUSION In this paper we presented LakhNES, a method for learn- ing to generate multi-instrumental music. We developed an event-based representation suitable for this task. Train- ing powerful language models on this representation re- sults in compelling multi-instrumental music generation. We show that we can further improve results both quanti- tati...

work page

[14] [14]

This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program

ACKNOWLEDGEMENTS Thanks to Cheng-Zhi Anna Huang, Cheng-i Wang, and Jennifer Hsu for helpful discussions regarding this work. This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program. GPUs used in this research were donated by NVIDIA

work page

[15] [15]

Dai, Matthew D

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Mon- ica Dinculescu, and Douglas Eck. Music Transformer: Generating music with long-term structure. In Proc . ICLR, 2019

work page 2019

[16] [16]

The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes

Chris Donahue, Huanru Henry Mao, and Julian McAuley. The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes. In Proc. ISMIR, 2018

work page 2018

[17] [17]

Enabling factorized piano music modeling and generation with the MAESTRO dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proc. ICLR, 2019

work page 2019

[18] [18]

Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching

Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching . PhD thesis, Columbia University, 2016

work page 2016

[19] [19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, 2019

work page 2019

[21] [21]

Transfer learning for music classiﬁ- cation and regression tasks

Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Transfer learning for music classiﬁ- cation and regression tasks. In Proc. ISMIR, 2017

work page 2017

[22] [22]

Algorithmic composition: paradigms of automated music generation

Gerhard Nierhaus. Algorithmic composition: paradigms of automated music generation . Springer Science & Business Media, 2009

work page 2009

[23] [23]

A connectionist approach to algorithmic composition

Peter M Todd. A connectionist approach to algorithmic composition. Computer Music Journal, 1989

work page 1989

[24] [24]

Neural network music composition by prediction: Exploring the beneﬁts of psychoacous- tic constraints and multi-scale processing

Michael C Mozer. Neural network music composition by prediction: Exploring the beneﬁts of psychoacous- tic constraints and multi-scale processing. Connection Science, 1994

work page 1994

[25] [25]

Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks

Douglas Eck and Jürgen Schmidhuber. Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks. In Proc . Neural Networks for Sig- nal Processing, 2002

work page 2002

[26] [26]

Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription

Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. In Proc . ICML, 2012

work page 2012

[27] [27]

Generating polyphonic music using tied parallel networks

Daniel D Johnson. Generating polyphonic music using tied parallel networks. In Proc . International Confer- ence on Evolutionary and Biologically Inspired Music and Art, 2017

work page 2017

[28] [28]

MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation

Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation. In Proc . ISMIR, 2017

work page 2017

[29] [29]

Performance RNN: Generating music with expressive timing and dynam- ics

Ian Simon and Sageev Oore. Performance RNN: Generating music with expressive timing and dynam- ics. https://magenta.tensorflow.org/ performance-rnn, 2017

work page 2017

[30] [30]

DeepJ: Style-speciﬁc music generation

Huanru Henry Mao, Taylor Shin, and Garrison Cot- trell. DeepJ: Style-speciﬁc music generation. In Proc . International Conference on Semantic Computing , 2018

work page 2018

[31] [31]

Harmonis- ing chorales by probabilistic inference

Moray Allan and Christopher Williams. Harmonis- ing chorales by probabilistic inference. In Proc . NIPS, 2005

work page 2005

[32] [32]

Counter- point by convolution

Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counter- point by convolution. In Proc. ISMIR, 2017

work page 2017

[33] [33]

DeepBach: A steerable model for Bach chorales generation

Gaëtan Hadjeres and François Pachet. DeepBach: A steerable model for Bach chorales generation. In Proc. ICML, 2017

work page 2017

[34] [34]

Part-invariant model for music genera- tion and harmonization

Yujia Yan, Ethan Lustig, Joseph VanderStel, and Zhiyao Duan. Part-invariant model for music genera- tion and harmonization. In Proc. ISMIR, 2018

work page 2018

[35] [35]

Convolutional generative adversarial networks with binary neurons for polyphonic music generation

Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proc . ISMIR , 2018

work page 2018

[36] [36]

A hierarchical latent vector model for learning long-term structure in music

Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In Proc. ICML, 2018

work page 2018

[37] [37]

MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment

Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi- Hsuan Yang. MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment. In Proc. AAAI, 2018

work page 2018

[38] [38]

Adversarial audio synthesis

Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In Proc. ICLR, 2018

work page 2018

[39] [39]

The challenge of realistic music generation: modelling raw audio at scale

Sander Dieleman, Aäron van den Oord, and Karen Si- monyan. The challenge of realistic music generation: modelling raw audio at scale. In Proc. NeurIPS, 2018

work page 2018

[40] [40]

Christine McLeavy Payne. MuseNet. https:// openai.com/blog/musenet/, 2019

work page 2019

[41] [41]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NIPS, 2017

work page 2017

[42] [42]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Rus- lan Salakhutdinov. Transformer-XL: Attentive language models beyond a ﬁxed-length context. arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[43] [43]

HARMONET: A neural net for harmonizing chorales in the style of JS Bach

Hermann Hild, Johannes Feulner, and Wolfram Men- zel. HARMONET: A neural net for harmonizing chorales in the style of JS Bach. In Proc. NIPS, 1992

work page 1992

[44] [44]

A survey on trans- fer learning

Sinno Jialin Pan and Qiang Yang. A survey on trans- fer learning. IEEE Transactions on knowledge and data engineering, 2010

work page 2010

[45] [45]

Long short- term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short- term memory. Neural computation, 1997

work page 1997

[46] [46]

Hierar- chical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierar- chical neural story generation. In Proc. ACL, 2018

work page 2018