Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

Joonwon Seo

arxiv: 2601.03612 · v7 · submitted 2026-01-07 · 💻 cs.LG · cs.SD· eess.AS

Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

Joonwon Seo This is my paper

Pith reviewed 2026-05-16 17:11 UTC · model grok-4.3

classification 💻 cs.LG cs.SDeess.AS

keywords polyphonic music generationstructural inductive biasSmart Embeddingparameter reductioninformation theoryRademacher complexityBeethoven sonatasAI music models

0 comments

The pith

The Smart Embedding architecture reduces parameters by 48.30 percent in polyphonic music models by splitting pitch and hand attributes based on their measured independence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a structural inductive bias to solve the missing middle problem in polyphonic music generation, using Beethoven piano sonatas as the test case. It first measures that pitch and hand attributes are nearly independent with normalized mutual information of 0.167, then builds the Smart Embedding layer that encodes them separately. Rigorous bounds show the split incurs at most 0.153 bits of information loss and yields a 28.09 percent tighter generalization bound via Rademacher complexity. In practice the architecture cuts parameters nearly in half and lowers validation loss by 9.47 percent, with SVD checks and a 53-person listening study confirming retained musical quality.

Core claim

By separating pitch and hand embeddings in the Smart Embedding architecture, the method injects domain-specific inductive bias that exploits the low normalized mutual information of 0.167 between these attributes, delivering a 48.30 percent parameter reduction while keeping information loss below 0.153 bits and tightening the Rademacher generalization bound by 28.09 percent, as verified empirically by a 9.47 percent drop in validation loss on Beethoven sonata data.

What carries the argument

The Smart Embedding architecture, which structurally decouples pitch and hand attribute embeddings to enforce independence-based inductive bias.

If this is right

Model size shrinks by 48.30 percent while validation loss falls 9.47 percent on the target repertoire.
Generalization improves with a 28.09 percent tighter Rademacher bound and negligible information loss under 0.153 bits.
Category-theoretic arguments establish greater training stability for the separated embedding structure.
SVD analysis and listening tests with 53 experts confirm that generated polyphonic output retains quality.
The same separation principle supplies a template for injecting measurable structural bias in other sequence-generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same independence measurement could be applied to other instrument pairs or musical styles to test whether the parameter savings generalize.
The architecture might be adapted to separate other weakly correlated factors such as rhythm and dynamics in longer-form generation.
If the NMI remains low across corpora, the method offers a practical way to scale music models without proportional growth in parameters.
Combining the information-theoretic bound with the category-theoretic stability proof could guide similar splits in non-musical multimodal models.

Load-bearing premise

The assumption that pitch and hand attributes remain sufficiently independent outside the Beethoven corpus so that the structural split preserves musical coherence.

What would settle it

Running the model on a new music corpus where normalized mutual information between pitch and hand attributes exceeds 0.3 and measuring whether validation loss reduction disappears or expert listeners report loss of coherence.

Figures

Figures reproduced from arXiv: 2601.03612 by Joonwon Seo.

**Figure 1.1.** Figure 1.1: Conceptual diagram of the "Missing Middle." Current SOTA models excel at Global Form and Local Patterns but struggle with the intermediate level of coherent Phrases. 1.2 Problem Definition: The "Missing Middle" and the Limits of SOTA The central challenge lies in the intermediate structural level of music. This limitation, which we characterize as the "Missing Middle" problem—building on the hierarchical… view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 5.1.** Figure 5.1: Comparison of Validation Loss. Smart ON demonstrates faster convergence and a significantly lower final loss (1.013) compared to the baseline (1.119). 5.3.3 Interpretation: Empirical Confirmation of Theoretical Guarantees These empirical results provide direct confirmation of the theoretical guarantees established in Chapter 5. Theorem 2 (Rademacher Complexity, Section 5.3) proves that Smart Embedding yi… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 5.2.** Figure 5.2: Comparison of normalized singular value spectra. The Smart ON architecture (blue) maintains a stable, efficient distribution of information across dimensions, avoiding the sharp rank collapse and information loss observed in the baseline (gray dashed). This enables higher effective rank with fewer parameters. 5.5.1 Methodology: Texture Metrics We define the following metrics: • Hand Balance Ratio: Measur… view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

**Figure 7.1.** Figure 7.1: Comparison of Validation Loss. Smart ON demonstrates faster convergence and a significantly lower final loss (1.013) compared to the baseline (1.119). 7.3.3 Interpretation: Empirical Confirmation of Theoretical Guarantees These empirical results provide direct confirmation of the theoretical guarantees established in Chapter 5. Theorem 2 (Rademacher Complexity, Section 5.3) proves that Smart Embedding yi… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗

**Figure 9.1.** Figure 9.1: Discovery of the SVD Paradox. Ablation study comparing validation losses across architectural variants. Note how Smart v2 (Wide) significantly outperforms the Baseline (Dense) despite parameter parity, while increasing depth alone leads to optimization collapse. 9.2.3 Large-Scale Validation at 700M Parameters To confirm that the SVD Paradox is not limited to toy-scale matrix recovery tasks, we conducted … view at source ↗

**Figure 9.2.** Figure 9.2: Topology Impact on Loss (Left) and Optimized Topology Heatmap (Right). [PITH_FULL_IMAGE:figures/full_fig_p057_9_2.png] view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p075_10.png] view at source ↗

**Figure 10.1.** Figure 10.1: The "Kill Shot" Experiment. Performance comparison on synthetic data across increasing dimensions. At the critical threshold of d = 1024, the Dense model (red) undergoes a phase transition and collapses, while the Smart Embedding (blue) remains robust, validating the RPTP theoretical guarantee [PITH_FULL_IMAGE:figures/full_fig_p075_10_1.png] view at source ↗

read the original abstract

This monograph introduces a novel approach to polyphonic music generation by addressing the "Missing Middle" problem through structural inductive bias. Focusing on Beethoven's piano sonatas as a case study, we empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture, achieving a 48.30% reduction in parameters. We provide rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity (28.09% tighter generalization bound), and category theory to demonstrate improved stability and generalization. Empirical results show a 9.47% reduction in validation loss, confirmed by SVD analysis and an expert listening study (N=53). This dual theoretical and applied framework bridges gaps in AI music generation, offering verifiable insights for mathematically grounded deep learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Smart Embedding claims a 48% parameter cut for polyphonic music via low NMI on Beethoven data plus info-theory and Rademacher bounds, but the proofs and broader tests are missing.

read the letter

The paper's core move is to use low normalized mutual information between pitch and hand attributes in Beethoven piano music to justify splitting the embedding space, which they say cuts parameters by nearly half with almost no information loss. What stands out as new is applying this structural bias specifically to polyphonic generation, along with combining information-theoretic bounds, Rademacher complexity arguments, and category theory for stability. The empirical side shows a validation loss drop and a small listening test. It does a decent job framing the missing middle problem and trying to make the inductive bias explicit rather than just hoping the model learns it. The soft spots are more concerning. The NMI of 0.167 is only reported for the Beethoven corpus, so if other styles like Bach or Chopin have higher dependence, the split could lose important joint structure that the bounds don't cover. The listening study with 53 participants is small, and without error bars or full baseline tables it's tough to judge the 9.47% improvement. Most critically, the abstract mentions rigorous proofs but doesn't show the derivation steps, so I can't tell if the 0.153-bit bound or the 28% tighter generalization is tight or relies on hidden assumptions. This work is aimed at people building generative models for structured sequences who want to inject domain knowledge via embeddings. A reader interested in parameter-efficient music AI could pick up the Smart Embedding idea and the independence check, but they'd need to re-derive the math themselves to be sure. I'd recommend sending it to peer review. The idea is worth checking, but the authors will need to expand the proofs and test the split on more repertoires before it lands.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Smart Embedding architecture for polyphonic music generation to address the 'Missing Middle' problem via structural inductive bias. Using Beethoven's piano sonatas as a case study, it verifies low dependence between pitch and hand attributes (NMI=0.167), proposes Smart Embedding for a claimed 48.30% parameter reduction with 0.153-bit information loss, supplies proofs via information theory, Rademacher complexity (28.09% tighter generalization bound), and category theory for stability, and reports 9.47% validation loss reduction, SVD analysis, and an expert listening study (N=53).

Significance. If the independence generalizes and the bounds are non-circular, the work supplies a mathematically grounded inductive bias that could reduce model size while preserving coherence in music generation tasks. The explicit use of information-theoretic loss bounds, Rademacher analysis, and category-theoretic stability arguments, together with empirical validation and a listening study, strengthens the case for theory-driven architecture design in creative sequence modeling.

major comments (2)

[Abstract] Abstract: The justification for the structural split in Smart Embedding rests on NMI=0.167 between pitch and hand attributes, reported only for the Beethoven sonata corpus and treated as licensing a general inductive bias. If dependence is materially higher in other polyphonic repertoires, the claimed 0.153-bit negligible loss and the Rademacher bound tightness are no longer guaranteed, directly undermining the central parameter-reduction and generalization claims.
[Abstract] Abstract: No derivation steps, explicit baseline models, or error bars are supplied for the 48.30% parameter reduction, 9.47% validation-loss improvement, or the 28.09% tighter Rademacher bound. Without these, the soundness of the information-theoretic and complexity arguments cannot be verified and the numerical claims remain unassessable.

minor comments (1)

[Abstract] Abstract: The listening study (N=53) is small; reporting confidence intervals or statistical power for the expert evaluations would strengthen the empirical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The justification for the structural split in Smart Embedding rests on NMI=0.167 between pitch and hand attributes, reported only for the Beethoven sonata corpus and treated as licensing a general inductive bias. If dependence is materially higher in other polyphonic repertoires, the claimed 0.153-bit negligible loss and the Rademacher bound tightness are no longer guaranteed, directly undermining the central parameter-reduction and generalization claims.

Authors: The manuscript presents Beethoven's piano sonatas explicitly as a case study to empirically verify the low dependence (NMI=0.167) between pitch and hand attributes, which motivates the Smart Embedding architecture. This structural bias is grounded in the physical separation of hands in piano performance rather than claimed as universal. We will revise the abstract and introduction to emphasize that all quantitative claims (including the 0.153-bit bound derived from NMI via information-theoretic inequalities and the Rademacher tightness) are conditional on the observed independence in this repertoire, and we will add a discussion of potential generalization to other polyphonic styles along with the underlying musical rationale. revision: partial
Referee: [Abstract] Abstract: No derivation steps, explicit baseline models, or error bars are supplied for the 48.30% parameter reduction, 9.47% validation-loss improvement, or the 28.09% tighter Rademacher bound. Without these, the soundness of the information-theoretic and complexity arguments cannot be verified and the numerical claims remain unassessable.

Authors: We agree that the abstract and main text lack sufficient detail for verification. In the revision we will (i) provide the explicit derivation of the 48.30% parameter reduction by comparing the embedding parameter counts of the standard baseline (full joint embedding over pitch+hand) versus the factored Smart Embedding, (ii) name the baseline models (standard Transformer with joint embeddings), and (iii) report error bars or standard deviations for the 9.47% validation-loss reduction and the 28.09% Rademacher-bound improvement, computed over multiple random seeds. Intermediate steps for the Rademacher analysis (Theorem 3) will be expanded in the main text or appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper measures NMI=0.167 empirically on the Beethoven corpus to support the pitch/hand split, then applies standard information-theoretic bounds (0.153-bit loss), Rademacher complexity arguments, and category theory to the resulting Smart Embedding architecture. These steps do not reduce by construction to self-definition, fitted parameters renamed as predictions, or self-citation chains; the reported parameter reduction and validation-loss improvement follow from the architecture choice rather than tautological re-derivation of inputs. The NMI measurement is external data, not an internal fit, and the mathematical claims invoke general theorems rather than importing uniqueness results from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the measured independence of pitch and hand attributes plus the effectiveness of the newly introduced Smart Embedding layer; no explicit free parameters are stated, but the architecture itself is an invented modeling choice.

axioms (1)

domain assumption Pitch and hand attributes are independent enough to be separated without musical loss
Supported only by NMI=0.167 on the Beethoven corpus; treated as generalizable.

invented entities (1)

Smart Embedding no independent evidence
purpose: To embed the structural independence of pitch and hand into the model for parameter reduction
New architecture introduced in the paper; no external evidence of its necessity is provided beyond the reported gains.

pith-pipeline@v0.9.0 · 5441 in / 1332 out tokens · 35567 ms · 2026-05-16T17:11:28.331718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture... rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity... and category theory
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Smart Embedding as a Structure-Preserving Map... formalized using Category Theory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 14 internal anchors

[1]

Huron,Sweet anticipation: Music and the psychology of expectation

D. Huron,Sweet anticipation: Music and the psychology of expectation. MIT Press, 2006

work page 2006
[2]

C. L. Krumhansl,Cognitive foundations of musical pitch. Oxford University Press, 1990

work page 1990
[3]

Narmour,The analysis and cognition of basic melodic structures: The implication-realization model

E. Narmour,The analysis and cognition of basic melodic structures: The implication-realization model. University of Chicago Press, 1990

work page 1990
[4]

Musical composition with a high-speed digital computer,

L. A. Hiller and L. M. Isaacson, “Musical composition with a high-speed digital computer,” Journal of the Audio Engineering Society, vol. 6, no. 3, pp. 154–160, 1958

work page 1958
[5]

Xenakis,Formalized music: Thought and mathematics in composition

I. Xenakis,Formalized music: Thought and mathematics in composition. Pendragon Press, 1992

work page 1992
[6]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015

work page 2015
[7]

Goodfellow, Y

I. Goodfellow, Y. Bengio, and A. Courville,Deep learning. MIT Press, 2016

work page 2016
[8]

Representation learning: A review and new perspectives,

Y. Bengio, A. Courville, and P . Vincent, “Representation learning: A review and new perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013

work page 2013
[9]

Lerdahl and R

F. Lerdahl and R. Jackendoff,A generative theory of tonal music. MIT Press, 1983

work page 1983
[10]

Temperley,Music and probability

D. Temperley,Music and probability. MIT press, 2007

work page 2007
[11]

L. B. Meyer,Emotion and meaning in music. University of Chicago Press, 1956

work page 1956
[12]

Cope,Computers and musical style

D. Cope,Computers and musical style. A-R Editions, Inc., 1991

work page 1991
[13]

An expert system for harmonization of chorales in the style of j.s. bach,

K. Ebcioglu, “An expert system for harmonization of chorales in the style of j.s. bach,” Journal of Logic Programming, vol. 8, no. 1-2, pp. 145–185, 1990

work page 1990
[14]

A hierarchical latent vector model for learning long-term structure in music,

A. Robertset al., “A hierarchical latent vector model for learning long-term structure in music,” inProceedings of the 35th International Conference on Machine Learning (ICML), pp. 4364–4373, 2018

work page 2018
[15]

Auto-Encoding Variational Bayes

D. P . Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014. 80 BIBLIOGRAPHY81

work page 2014
[17]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017

work page 2017
[18]

Music Transformer

C.-Z. A. Huanget al., “Music transformer: Generating music with long-term structure,” arXiv preprint arXiv:1809.04281, 2018. Published at ICLR 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI Blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[21]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhariwal,et al., “Language mod- els are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020

work page 1901
[22]

Whole-song hierarchical generation of symbolic music using cascaded diffusion models,

Z. Wang, L. Min, and G. Xia, “Whole-song hierarchical generation of symbolic music using cascaded diffusion models,”arXiv preprint arXiv:2405.09901, 2024

work page arXiv 2024
[23]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020

work page 2020
[24]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022

work page 2022
[25]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Rameshet al., “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

This time with feeling: Learning expressive musical performance,

S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, “This time with feeling: Learning expressive musical performance,”Neural Computing and Applications, vol. 32, pp. 955–967, 2020

work page 2020
[27]

Virtuosonet: A hierarchical rnn-based system for modeling expressive piano performance,

D. Jeonget al., “Virtuosonet: A hierarchical rnn-based system for modeling expressive piano performance,” inProceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pp. 129–136, 2019

work page 2019
[28]

W. E. Caplin,Classical form: A theory of formal functions for the instrumental music of Haydn, Mozart, and Beethoven. Oxford University Press, 1998

work page 1998
[29]

Rosen,The classical style: Haydn, Mozart, Beethoven

C. Rosen,The classical style: Haydn, Mozart, Beethoven. WW Norton & Company, 1997

work page 1997
[30]

R. O. Gjerdingen,Music in the galant style. Oxford University Press, 2007

work page 2007
[31]

Schenker,Free composition (Der freie Satz)

H. Schenker,Free composition (Der freie Satz). Pendragon Press, 1979

work page 1979
[32]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,

H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018

work page 2018
[33]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,”arXiv preprint arXiv:1703.10847, 2017. 82BIBLIOGRAPHY

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Jukebox: A Generative Model for Music

P . Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[35]

Musenet,

C. Payne, “Musenet,”OpenAI Blog, 2019

work page 2019
[36]

Evaluation of creativity in automatic music generation systems,

K. Agres, D. Herremans, and G. Wiggins, “Evaluation of creativity in automatic music generation systems,” inMusical Metacreation, 2016

work page 2016
[37]

Computational models of expressive music performance: The state of the art,

G. Widmer and W. Goebl, “Computational models of expressive music performance: The state of the art,”Journal of New Music Research, vol. 33, no. 3, pp. 203–216, 2004

work page 2004
[38]

D. J. Levitin,This is your brain on music: The science of a human obsession. Dutton, 2006

work page 2006
[39]

Methods for music perception and cognition research,

T. Eerola and J. K. Vuoskoski, “Methods for music perception and cognition research,”The Oxford Handbook of Music Psychology, pp. 117–132, 2013

work page 2013
[40]

arXiv preprint arXiv:1709.01620 (2017)

J.-P . Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music generation - a survey,”arXiv preprint arXiv:1709.01620, 2017

work page arXiv 2017
[41]

Deep learning for music generation: History and ongoing challenges,

J.-P . Briot, G. Hadjeres, and F.-D. Pachet, “Deep learning for music generation: History and ongoing challenges,”Neural Computing and Applications, vol. 32, no. 4, pp. 981–1005, 2020

work page 2020
[42]

Evaluation of computational music generation: A review,

L.-C. Yang and A. Lerch, “Evaluation of computational music generation: A review,”ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–37, 2020

work page 2020
[43]

Computing machinery and intelligence,

A. M. Turing, “Computing machinery and intelligence,”Mind, vol. 59, no. 236, pp. 433–460, 1950

work page 1950
[44]

Notes by the translator,

A. A. Lovelace, “Notes by the translator,”Scientific Memoirs, vol. 3, pp. 666–731, 1843

work page
[45]

Cope,Virtual music: Computer synthesis of musical style

D. Cope,Virtual music: Computer synthesis of musical style. MIT Press, 2001

work page 2001
[46]

J. J. Fux,Gradus ad Parnassum. Johann Peter van Ghelen, 1725

work page
[47]

Schoenberg,Theory of harmony

A. Schoenberg,Theory of harmony. Univ of California Press, 1978

work page 1978
[48]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[49]

A first look at a new approach to connectivity and memory in recurrent networks,

D. Eck and J. Schmidhuber, “A first look at a new approach to connectivity and memory in recurrent networks,” inProceedings of the 8th Conference on Intelligent Autonomous Systems, 2002

work page 2002
[50]

Learning representations by back- propagating errors,

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back- propagating errors,”nature, vol. 323, no. 6088, pp. 533–536, 1986

work page 1986
[51]

Gradient-based learning applied to docu- ment recognition,

Y. LeCun, L. Bottou, Y. Bengio, and P . Haffner, “Gradient-based learning applied to docu- ment recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998
[52]

Sequence to sequence learning with neural net- works,

I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural net- works,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3104–3112, 2014. BIBLIOGRAPHY83

work page 2014
[53]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,”arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[54]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,”arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[55]

Distributed representations of words and phrases and their compositionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013

work page 2013
[56]

Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription,

N. Boulanger-Lewandowski, Y. Bengio, and P . Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription,” inProceedings of the 29th International Conference on Machine Learning (ICML), pp. 1159–1166, 2012

work page 2012
[57]

A predictive model for music composition based on the expectation-maximization algorithm,

S. Lattner, M. Grachten, and G. Widmer, “A predictive model for music composition based on the expectation-maximization algorithm,” inProceedings of the 9th International Conference on Computational Creativity (ICCC), 2018

work page 2018
[58]

Self-attention with relative position representations,

P . Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” inProceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 464–468, 2018

work page 2018
[59]

Transformer-xl: Attentive language models beyond a fixed-length context,

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019

work page 2019
[60]

Reformer: The efficient transformer,

N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[61]

Generating long sequences with sparse transformers,

R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019

work page 2019
[62]

Transformers are rnns: Fast autore- gressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autore- gressive transformers with linear attention,” inInternational Conference on Machine Learning (ICML), pp. 5156–5165, 2020

work page 2020
[63]

Rethinking attention with performers,

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P . Hawkins, J. Davis, A. Mohiuddin, and L. Kaiser, “Rethinking attention with performers,” inInterna- tional Conference on Learning Representations (ICLR), 2020

work page 2020
[64]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Suet al., “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[65]

Train short, test long: Attention with linear biases enables input length extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” inInternational Conference on Learning Representations (ICLR), 2022. 84BIBLIOGRAPHY

work page 2022
[66]

Pop music transformer: Beat-based modeling and generation of expressive piano performances,

Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive piano performances,” inProceedings of the 28th ACM International Conference on Multimedia, pp. 1198–1206, 2020

work page 2020
[67]

Wasserstein GAN

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,”arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Improved training of wasserstein gans,

I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. Courville, “Improved training of wasserstein gans,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[69]

Popmag: Pop music accompaniment generation,

Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu, “Popmag: Pop music accompaniment generation,” inProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pp. 1198–1206, 2020

work page 2020
[70]

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

G. Brunner, Y. Wang, R. Wattenhofer, and J. Weishaupt, “Midi-vae: Modeling dynamics and instrument compatibility of multi-track midi music,”arXiv preprint arXiv:1809.07600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Piano-tree vae: Structured representation learning for polyphonic music,

T. Nakamura, M. Y. H. Ikeda, and K. Yoshii, “Piano-tree vae: Structured representation learning for polyphonic music,” inProceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), pp. 694–701, 2020

work page 2020
[72]

Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,

W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 178–186, 2021

work page 2021
[73]

Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,

S.-L. Wu and Y.-H. Yang, “Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,” inProceedings of the 28th ACM International Conference on Multimedia (ACM MM), 2021

work page 2021
[74]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016
[75]

Dropout: a simple way to prevent neural networks from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[76]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[77]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),”arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[78]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[79]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[80]

T. M. Mitchell,Machine learning. McGraw-Hill, 1997. BIBLIOGRAPHY85

work page 1997

Showing first 80 references.

[1] [1]

Huron,Sweet anticipation: Music and the psychology of expectation

D. Huron,Sweet anticipation: Music and the psychology of expectation. MIT Press, 2006

work page 2006

[2] [2]

C. L. Krumhansl,Cognitive foundations of musical pitch. Oxford University Press, 1990

work page 1990

[3] [3]

Narmour,The analysis and cognition of basic melodic structures: The implication-realization model

E. Narmour,The analysis and cognition of basic melodic structures: The implication-realization model. University of Chicago Press, 1990

work page 1990

[4] [4]

Musical composition with a high-speed digital computer,

L. A. Hiller and L. M. Isaacson, “Musical composition with a high-speed digital computer,” Journal of the Audio Engineering Society, vol. 6, no. 3, pp. 154–160, 1958

work page 1958

[5] [5]

Xenakis,Formalized music: Thought and mathematics in composition

I. Xenakis,Formalized music: Thought and mathematics in composition. Pendragon Press, 1992

work page 1992

[6] [6]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015

work page 2015

[7] [7]

Goodfellow, Y

I. Goodfellow, Y. Bengio, and A. Courville,Deep learning. MIT Press, 2016

work page 2016

[8] [8]

Representation learning: A review and new perspectives,

Y. Bengio, A. Courville, and P . Vincent, “Representation learning: A review and new perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013

work page 2013

[9] [9]

Lerdahl and R

F. Lerdahl and R. Jackendoff,A generative theory of tonal music. MIT Press, 1983

work page 1983

[10] [10]

Temperley,Music and probability

D. Temperley,Music and probability. MIT press, 2007

work page 2007

[11] [11]

L. B. Meyer,Emotion and meaning in music. University of Chicago Press, 1956

work page 1956

[12] [12]

Cope,Computers and musical style

D. Cope,Computers and musical style. A-R Editions, Inc., 1991

work page 1991

[13] [13]

An expert system for harmonization of chorales in the style of j.s. bach,

K. Ebcioglu, “An expert system for harmonization of chorales in the style of j.s. bach,” Journal of Logic Programming, vol. 8, no. 1-2, pp. 145–185, 1990

work page 1990

[14] [14]

A hierarchical latent vector model for learning long-term structure in music,

A. Robertset al., “A hierarchical latent vector model for learning long-term structure in music,” inProceedings of the 35th International Conference on Machine Learning (ICML), pp. 4364–4373, 2018

work page 2018

[15] [15]

Auto-Encoding Variational Bayes

D. P . Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014. 80 BIBLIOGRAPHY81

work page 2014

[17] [17]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017

work page 2017

[18] [18]

Music Transformer

C.-Z. A. Huanget al., “Music transformer: Generating music with long-term structure,” arXiv preprint arXiv:1809.04281, 2018. Published at ICLR 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI Blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[21] [21]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhariwal,et al., “Language mod- els are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020

work page 1901

[22] [22]

Whole-song hierarchical generation of symbolic music using cascaded diffusion models,

Z. Wang, L. Min, and G. Xia, “Whole-song hierarchical generation of symbolic music using cascaded diffusion models,”arXiv preprint arXiv:2405.09901, 2024

work page arXiv 2024

[23] [23]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020

work page 2020

[24] [24]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022

work page 2022

[25] [25]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Rameshet al., “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

This time with feeling: Learning expressive musical performance,

S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, “This time with feeling: Learning expressive musical performance,”Neural Computing and Applications, vol. 32, pp. 955–967, 2020

work page 2020

[27] [27]

Virtuosonet: A hierarchical rnn-based system for modeling expressive piano performance,

D. Jeonget al., “Virtuosonet: A hierarchical rnn-based system for modeling expressive piano performance,” inProceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), pp. 129–136, 2019

work page 2019

[28] [28]

W. E. Caplin,Classical form: A theory of formal functions for the instrumental music of Haydn, Mozart, and Beethoven. Oxford University Press, 1998

work page 1998

[29] [29]

Rosen,The classical style: Haydn, Mozart, Beethoven

C. Rosen,The classical style: Haydn, Mozart, Beethoven. WW Norton & Company, 1997

work page 1997

[30] [30]

R. O. Gjerdingen,Music in the galant style. Oxford University Press, 2007

work page 2007

[31] [31]

Schenker,Free composition (Der freie Satz)

H. Schenker,Free composition (Der freie Satz). Pendragon Press, 1979

work page 1979

[32] [32]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,

H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018

work page 2018

[33] [33]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,”arXiv preprint arXiv:1703.10847, 2017. 82BIBLIOGRAPHY

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Jukebox: A Generative Model for Music

P . Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[35] [35]

Musenet,

C. Payne, “Musenet,”OpenAI Blog, 2019

work page 2019

[36] [36]

Evaluation of creativity in automatic music generation systems,

K. Agres, D. Herremans, and G. Wiggins, “Evaluation of creativity in automatic music generation systems,” inMusical Metacreation, 2016

work page 2016

[37] [37]

Computational models of expressive music performance: The state of the art,

G. Widmer and W. Goebl, “Computational models of expressive music performance: The state of the art,”Journal of New Music Research, vol. 33, no. 3, pp. 203–216, 2004

work page 2004

[38] [38]

D. J. Levitin,This is your brain on music: The science of a human obsession. Dutton, 2006

work page 2006

[39] [39]

Methods for music perception and cognition research,

T. Eerola and J. K. Vuoskoski, “Methods for music perception and cognition research,”The Oxford Handbook of Music Psychology, pp. 117–132, 2013

work page 2013

[40] [40]

arXiv preprint arXiv:1709.01620 (2017)

J.-P . Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music generation - a survey,”arXiv preprint arXiv:1709.01620, 2017

work page arXiv 2017

[41] [41]

Deep learning for music generation: History and ongoing challenges,

J.-P . Briot, G. Hadjeres, and F.-D. Pachet, “Deep learning for music generation: History and ongoing challenges,”Neural Computing and Applications, vol. 32, no. 4, pp. 981–1005, 2020

work page 2020

[42] [42]

Evaluation of computational music generation: A review,

L.-C. Yang and A. Lerch, “Evaluation of computational music generation: A review,”ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–37, 2020

work page 2020

[43] [43]

Computing machinery and intelligence,

A. M. Turing, “Computing machinery and intelligence,”Mind, vol. 59, no. 236, pp. 433–460, 1950

work page 1950

[44] [44]

Notes by the translator,

A. A. Lovelace, “Notes by the translator,”Scientific Memoirs, vol. 3, pp. 666–731, 1843

work page

[45] [45]

Cope,Virtual music: Computer synthesis of musical style

D. Cope,Virtual music: Computer synthesis of musical style. MIT Press, 2001

work page 2001

[46] [46]

J. J. Fux,Gradus ad Parnassum. Johann Peter van Ghelen, 1725

work page

[47] [47]

Schoenberg,Theory of harmony

A. Schoenberg,Theory of harmony. Univ of California Press, 1978

work page 1978

[48] [48]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[49] [49]

A first look at a new approach to connectivity and memory in recurrent networks,

D. Eck and J. Schmidhuber, “A first look at a new approach to connectivity and memory in recurrent networks,” inProceedings of the 8th Conference on Intelligent Autonomous Systems, 2002

work page 2002

[50] [50]

Learning representations by back- propagating errors,

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back- propagating errors,”nature, vol. 323, no. 6088, pp. 533–536, 1986

work page 1986

[51] [51]

Gradient-based learning applied to docu- ment recognition,

Y. LeCun, L. Bottou, Y. Bengio, and P . Haffner, “Gradient-based learning applied to docu- ment recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998

[52] [52]

Sequence to sequence learning with neural net- works,

I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural net- works,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3104–3112, 2014. BIBLIOGRAPHY83

work page 2014

[53] [53]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,”arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[54] [54]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,”arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[55] [55]

Distributed representations of words and phrases and their compositionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013

work page 2013

[56] [56]

Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription,

N. Boulanger-Lewandowski, Y. Bengio, and P . Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription,” inProceedings of the 29th International Conference on Machine Learning (ICML), pp. 1159–1166, 2012

work page 2012

[57] [57]

A predictive model for music composition based on the expectation-maximization algorithm,

S. Lattner, M. Grachten, and G. Widmer, “A predictive model for music composition based on the expectation-maximization algorithm,” inProceedings of the 9th International Conference on Computational Creativity (ICCC), 2018

work page 2018

[58] [58]

Self-attention with relative position representations,

P . Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” inProceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 464–468, 2018

work page 2018

[59] [59]

Transformer-xl: Attentive language models beyond a fixed-length context,

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019

work page 2019

[60] [60]

Reformer: The efficient transformer,

N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[61] [61]

Generating long sequences with sparse transformers,

R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019

work page 2019

[62] [62]

Transformers are rnns: Fast autore- gressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autore- gressive transformers with linear attention,” inInternational Conference on Machine Learning (ICML), pp. 5156–5165, 2020

work page 2020

[63] [63]

Rethinking attention with performers,

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P . Hawkins, J. Davis, A. Mohiuddin, and L. Kaiser, “Rethinking attention with performers,” inInterna- tional Conference on Learning Representations (ICLR), 2020

work page 2020

[64] [64]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Suet al., “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[65] [65]

Train short, test long: Attention with linear biases enables input length extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” inInternational Conference on Learning Representations (ICLR), 2022. 84BIBLIOGRAPHY

work page 2022

[66] [66]

Pop music transformer: Beat-based modeling and generation of expressive piano performances,

Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive piano performances,” inProceedings of the 28th ACM International Conference on Multimedia, pp. 1198–1206, 2020

work page 2020

[67] [67]

Wasserstein GAN

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,”arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

Improved training of wasserstein gans,

I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. Courville, “Improved training of wasserstein gans,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[69] [69]

Popmag: Pop music accompaniment generation,

Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu, “Popmag: Pop music accompaniment generation,” inProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pp. 1198–1206, 2020

work page 2020

[70] [70]

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

G. Brunner, Y. Wang, R. Wattenhofer, and J. Weishaupt, “Midi-vae: Modeling dynamics and instrument compatibility of multi-track midi music,”arXiv preprint arXiv:1809.07600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[71] [71]

Piano-tree vae: Structured representation learning for polyphonic music,

T. Nakamura, M. Y. H. Ikeda, and K. Yoshii, “Piano-tree vae: Structured representation learning for polyphonic music,” inProceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), pp. 694–701, 2020

work page 2020

[72] [72]

Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,

W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 178–186, 2021

work page 2021

[73] [73]

Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,

S.-L. Wu and Y.-H. Yang, “Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae,” inProceedings of the 28th ACM International Conference on Multimedia (ACM MM), 2021

work page 2021

[74] [74]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016

[75] [75]

Dropout: a simple way to prevent neural networks from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929

[76] [76]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[77] [77]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),”arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[78] [78]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[79] [79]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[80] [80]

T. M. Mitchell,Machine learning. McGraw-Hill, 1997. BIBLIOGRAPHY85

work page 1997