pith. sign in

arxiv: 1907.04868 · v1 · pith:OUZFZGLHnew · submitted 2019-07-10 · 💻 cs.SD · cs.LG· cs.MM· eess.AS· stat.ML

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Pith reviewed 2026-05-24 23:14 UTC · model grok-4.3

classification 💻 cs.SD cs.LGcs.MMeess.ASstat.ML
keywords music generationtransformerpre-trainingtransfer learningmulti-instrumentalNES-MDBLakh MIDIsymbolic music
0
0 comments X

The pith

Pre-training a Transformer on diverse MIDI data improves generation of four-instrument NES music scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper adapts the Transformer model to generate music scores for four instruments using the NES-MDB dataset of early video game sounds. It first trains the model on the much larger but varied Lakh MIDI collection and then fine-tunes it on the NES data. The procedure produces better measured performance and better-sounding results than training only on the NES set, even though the two collections differ in instruments, sound production, and musical style. A sympathetic reader would care because music generation models need large amounts of data to learn long-term structure, and this shows one workable way to borrow from bigger mismatched collections.

Core claim

The authors establish that pre-training the multi-instrument Transformer on the Lakh MIDI dataset followed by fine-tuning on the NES-MDB dataset yields improved quantitative and qualitative performance on generating four-instrument NES scores compared to training solely on NES-MDB, despite the differences in the two corpora.

What carries the argument

The cross-domain pre-training procedure that first trains on the large heterogeneous Lakh MIDI collection before fine-tuning on NES-MDB.

If this is right

  • The pre-trained model records higher quantitative metrics on the NES generation task than a model trained from scratch on NES-MDB alone.
  • Human listeners rate the outputs from the pre-trained model as higher quality than those from the non-pre-trained baseline.
  • The reported gains appear even though the pre-training corpus and target corpus differ in instrumentation, synthesis method, and musical style.
  • The same pre-training step can be reused to improve Transformer performance on other multi-instrument generation tasks that start with limited domain data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The patterns learned during pre-training on broad MIDI may include general rules for coordinating multiple voices that apply even to very different sound palettes.
  • The same two-stage approach could be tested on other small symbolic music collections to see whether volume of pre-training data routinely compensates for domain mismatch.
  • If the benefit scales with the size of the pre-training set, then further gains might come from using even larger heterogeneous music corpora before specializing.

Load-bearing premise

Pre-training on the heterogeneous Lakh MIDI data transfers useful features to the NES-MDB four-instrument domain even though the corpora differ in instrumentation, synthesis, and musical style.

What would settle it

Training the same Transformer only on NES-MDB and obtaining equal or better quantitative metrics plus listener ratings than the pre-trained version would show that the transfer step adds no benefit.

read the original abstract

We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper adapts the Transformer architecture to multi-instrumental symbolic music generation on the NES-MDB dataset of four-instrument chiptune scores. It introduces a cross-domain pre-training procedure on the larger, heterogeneous Lakh MIDI corpus and reports that this transfer learning step yields both quantitative and qualitative improvements on the target NES-MDB task despite differences in instrumentation, synthesis, and musical style.

Significance. If the empirical gains are shown to arise from transferable musical structure rather than generic optimization effects, the result would strengthen the case for cross-domain pre-training as a practical remedy for data scarcity in specialized symbolic music domains. The work also supplies a concrete demonstration of Transformer scaling to multi-track generation, which is a natural but non-trivial extension of prior piano-only results.

major comments (2)
  1. [Abstract, §4] Abstract and §4: the central claim that pre-training on Lakh MIDI supplies useful cross-domain features is not isolated from generic pre-training effects. No control experiment (e.g., pre-training on shuffled or non-musical sequences of comparable size) is described, so the reported gains could be explained by longer effective optimization or regularization alone.
  2. [§3, §5] §3 and §5: the quantitative evaluation lacks reported baseline numbers, dataset sizes, training details, error bars, or statistical tests. Without these, the magnitude and reliability of the claimed improvement cannot be assessed, undermining the assertion that transfer learning is responsible for the observed difference.
minor comments (2)
  1. [§3] Notation for instrument tokens and event encoding is introduced without a consolidated table; a single reference table would improve readability.
  2. [§5] The qualitative listening study description omits the number of raters, rating scale, and inter-rater agreement statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4: the central claim that pre-training on Lakh MIDI supplies useful cross-domain features is not isolated from generic pre-training effects. No control experiment (e.g., pre-training on shuffled or non-musical sequences of comparable size) is described, so the reported gains could be explained by longer effective optimization or regularization alone.

    Authors: We agree this is a valid concern: without a non-musical control, the contribution of domain-specific structure versus generic optimization benefits cannot be fully isolated. In revision we will add an explicit discussion of this limitation in §4 and, resources permitting, a small-scale control experiment on shuffled token sequences of matched length. We will also clarify that the pre-training and fine-tuning schedules were matched in total steps to reduce the chance that the observed gains are due solely to extra optimization. revision: partial

  2. Referee: [§3, §5] §3 and §5: the quantitative evaluation lacks reported baseline numbers, dataset sizes, training details, error bars, or statistical tests. Without these, the magnitude and reliability of the claimed improvement cannot be assessed, undermining the assertion that transfer learning is responsible for the observed difference.

    Authors: We will expand §3 and §5 in the revision to include explicit dataset sizes, full training hyper-parameters, tabulated baseline numbers, error bars from multiple random seeds, and statistical significance tests comparing the pre-trained and non-pre-trained models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical transfer result on distinct corpora

full rationale

The paper reports an empirical comparison of Transformer models trained from scratch on NES-MDB versus pre-trained on Lakh MIDI then fine-tuned, with gains measured on held-out target-domain data. No equations, parameter fits renamed as predictions, or self-citation chains appear in the derivation of the central claim. The result is obtained by direct experiment on two separate datasets and is therefore falsifiable without reducing to self-definition or imported uniqueness.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that Transformer sequence modeling applies to symbolic music and that cross-domain transfer is feasible despite corpus mismatch; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Transformer architecture is capable of capturing long-term structure in sequence data such as music scores
    Invoked in the abstract to justify adapting the model from piano to multi-instrument setting.
  • domain assumption Pre-training on heterogeneous music data can transfer useful representations to a target domain despite differences in instrumentation and style
    This premise is required for the reported improvement to follow from the described procedure.

pith-pipeline@v0.9.0 · 5727 in / 1358 out tokens · 66428 ms · 2026-05-24T23:14:21.538808+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

    INTRODUCTION In this paper, we extend recent results for symbolic pi- ano music generation [1] to the multi-instrumental setting. Both piano and multi-instrumental music are polyphonic, where multiple notes may be sounding at any given point in time. However, the generation of multi-instrumental music presents an additional challenge not present in the pi...

  2. [2]

    Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview

    RELA TED WORK Music generation has been an active area of research for decades. Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview. Recent research has favored machine learning systems which automatically extract patterns from corpora of hum...

  3. [3]

    DA TASETS AND TASK The NES Music Database (NES-MDB) [2] consists of ap- proximately 46 hours of music composed for the sound chip on the Nintendo Entertainment System. This dataset is enticing for research in multi-instrumental music gener- ation because (1) it is an unusually large corpus of music that was composed for a fixed ensemble, and (2) it is avai...

  4. [4]

    We factorize the joint probability of a musical sequence consisting ofN events (E1,

    METHODOLOGY To model the event sequences outlined in the last section, we adopt a language modeling factorization. We factorize the joint probability of a musical sequence consisting ofN events (E1, . . . , EN ) into a product of conditionals: P (E1)· P (E2| E1)· . . .· P (EN| E1, . . . , EN −1). (1) This factorization is convenient because it allows for ...

  5. [5]

    We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average

    EXPERIMENTS We first conduct an experiment to train Transformer- XL [28] on our event representation (Section 3.2) of NES- MDB. We train the model on excerpts from the training data of 512 events; each excerpt represents around 9 sec- onds of music on average. Because of the recurrent atten- tion mechanism in Transformer-XL, the model effectively has acces...

  6. [6]

    (Standard) Transpose melodic voices by a random number of semitones between−6 and 5 (inclusive)

  7. [7]

    (Standard) Adjust the speed of the piece by a random percentage between±5%

  8. [8]

    Half of the time, remove a random number of instru- ments from the ensemble (leaving at least one)

  9. [9]

    Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1)

    Half of the time, shuffle the score-to-instrument alignment for the melodic instruments only (e.g., TR performs P2’s part). Finally, we experimented with pre-training our model on the Lakh MIDI dataset mapped to the NES ensemble (Section 4.2.1). To conduct this experiment, we first split the Lakh data into training and validation subsets. We then trained th...

  10. [10]

    QUANTITA TIVE ANALYSIS We report the perplexity (PPL) of each model on the test set in Table 2. Perplexity is calculated by first averaging the negative log-likelihood of each model across the test data, then exponentiating the average, i.e.,e 1 N ∑N i=1 − log qi, where qi is the likelihood assigned by a given model to the i-th event. A lower perplexity on...

  11. [11]

    Turing test

    USER STUDY While perplexity is a useful quantitative metric for model comparison, it is not necessarily correlated with human 0 1 2 3 4 Number of Lakh pre-training epochs 2.5 2.6 2.7NES-MDB test PPL 2.74 2.55 2.47 2.46 Figure 3. Measuring the performance improvement when doubling the amount of Lakh MIDI pre-training before fine-tuning Transformer-XL on NES...

  12. [12]

    For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas

    PAIRING LAKHNES WITH HUMANS In addition to generating chiptunes from scratch, LakhNES can be used for a number of tasks to assist human com- posers. For example, LakhNES can be “primed” on human-composed material and then asked to continue the material, providing a method for composers to quickly ex- pand on their ideas. Composers can also provide fixed rh...

  13. [13]

    We developed an event-based representation suitable for this task

    CONCLUSION In this paper we presented LakhNES, a method for learn- ing to generate multi-instrumental music. We developed an event-based representation suitable for this task. Train- ing powerful language models on this representation re- sults in compelling multi-instrumental music generation. We show that we can further improve results both quanti- tati...

  14. [14]

    This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program

    ACKNOWLEDGEMENTS Thanks to Cheng-Zhi Anna Huang, Cheng-i Wang, and Jennifer Hsu for helpful discussions regarding this work. This work was supported by UC San Diego’s Chancellors Research Excellence Scholarship program. GPUs used in this research were donated by NVIDIA

  15. [15]

    Dai, Matthew D

    Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Mon- ica Dinculescu, and Douglas Eck. Music Transformer: Generating music with long-term structure. In Proc . ICLR, 2019

  16. [16]

    The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes

    Chris Donahue, Huanru Henry Mao, and Julian McAuley. The NES Music Database: A multi- instrumental dataset with expressive performance at- tributes. In Proc. ISMIR, 2018

  17. [17]

    Enabling factorized piano music modeling and generation with the MAESTRO dataset

    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proc. ICLR, 2019

  18. [18]

    Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching

    Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi align- ment and matching . PhD thesis, Columbia University, 2016

  19. [19]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. arXiv:1810.04805, 2018

  20. [20]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, 2019

  21. [21]

    Transfer learning for music classifi- cation and regression tasks

    Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Transfer learning for music classifi- cation and regression tasks. In Proc. ISMIR, 2017

  22. [22]

    Algorithmic composition: paradigms of automated music generation

    Gerhard Nierhaus. Algorithmic composition: paradigms of automated music generation . Springer Science & Business Media, 2009

  23. [23]

    A connectionist approach to algorithmic composition

    Peter M Todd. A connectionist approach to algorithmic composition. Computer Music Journal, 1989

  24. [24]

    Neural network music composition by prediction: Exploring the benefits of psychoacous- tic constraints and multi-scale processing

    Michael C Mozer. Neural network music composition by prediction: Exploring the benefits of psychoacous- tic constraints and multi-scale processing. Connection Science, 1994

  25. [25]

    Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks

    Douglas Eck and Jürgen Schmidhuber. Finding tempo- ral structure in music: Blues improvisation with LSTM recurrent networks. In Proc . Neural Networks for Sig- nal Processing, 2002

  26. [26]

    Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription

    Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. In Proc . ICML, 2012

  27. [27]

    Generating polyphonic music using tied parallel networks

    Daniel D Johnson. Generating polyphonic music using tied parallel networks. In Proc . International Confer- ence on Evolutionary and Biologically Inspired Music and Art, 2017

  28. [28]

    MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation

    Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A convolutional generative adversarial net- work for symbolic-domain music generation. In Proc . ISMIR, 2017

  29. [29]

    Performance RNN: Generating music with expressive timing and dynam- ics

    Ian Simon and Sageev Oore. Performance RNN: Generating music with expressive timing and dynam- ics. https://magenta.tensorflow.org/ performance-rnn, 2017

  30. [30]

    DeepJ: Style-specific music generation

    Huanru Henry Mao, Taylor Shin, and Garrison Cot- trell. DeepJ: Style-specific music generation. In Proc . International Conference on Semantic Computing , 2018

  31. [31]

    Harmonis- ing chorales by probabilistic inference

    Moray Allan and Christopher Williams. Harmonis- ing chorales by probabilistic inference. In Proc . NIPS, 2005

  32. [32]

    Counter- point by convolution

    Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counter- point by convolution. In Proc. ISMIR, 2017

  33. [33]

    DeepBach: A steerable model for Bach chorales generation

    Gaëtan Hadjeres and François Pachet. DeepBach: A steerable model for Bach chorales generation. In Proc. ICML, 2017

  34. [34]

    Part-invariant model for music genera- tion and harmonization

    Yujia Yan, Ethan Lustig, Joseph VanderStel, and Zhiyao Duan. Part-invariant model for music genera- tion and harmonization. In Proc. ISMIR, 2018

  35. [35]

    Convolutional generative adversarial networks with binary neurons for polyphonic music generation

    Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proc . ISMIR , 2018

  36. [36]

    A hierarchical latent vector model for learning long-term structure in music

    Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In Proc. ICML, 2018

  37. [37]

    MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment

    Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi- Hsuan Yang. MuseGAN: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment. In Proc. AAAI, 2018

  38. [38]

    Adversarial audio synthesis

    Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In Proc. ICLR, 2018

  39. [39]

    The challenge of realistic music generation: modelling raw audio at scale

    Sander Dieleman, Aäron van den Oord, and Karen Si- monyan. The challenge of realistic music generation: modelling raw audio at scale. In Proc. NeurIPS, 2018

  40. [40]

    Christine McLeavy Payne. MuseNet. https:// openai.com/blog/musenet/, 2019

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NIPS, 2017

  42. [42]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Rus- lan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv:1901.02860, 2019

  43. [43]

    HARMONET: A neural net for harmonizing chorales in the style of JS Bach

    Hermann Hild, Johannes Feulner, and Wolfram Men- zel. HARMONET: A neural net for harmonizing chorales in the style of JS Bach. In Proc. NIPS, 1992

  44. [44]

    A survey on trans- fer learning

    Sinno Jialin Pan and Qiang Yang. A survey on trans- fer learning. IEEE Transactions on knowledge and data engineering, 2010

  45. [45]

    Long short- term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short- term memory. Neural computation, 1997

  46. [46]

    Hierar- chical neural story generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierar- chical neural story generation. In Proc. ACL, 2018