Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

Hideo Mukai; Shinnosuke Taksuka

arxiv: 2605.21081 · v1 · pith:PUKFMZWXnew · submitted 2026-05-20 · 💻 cs.SD · cs.LG

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

Shinnosuke Taksuka , Hideo Mukai This is my paper

Pith reviewed 2026-05-21 01:40 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords music generationtransformer modelattention mechanismmusical metadataAI compositionmelody coherence

0 comments

The pith

Incorporating musical metadata into the attention mechanism lets Transformers generate more coherent and less repetitive music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new attention mechanism called Musical Attention for Transformer models in music generation. It integrates meta-information such as bar numbers, key, time signatures, and tempos along with note features like pitch, onset, duration, and velocity. By explicitly modeling correlations among these eight elements in the attention process, the model better captures musical structure. Experiments show this approach reduces excessive repetition and improves overall quality compared to standard full or strided attention methods. This matters because current AI music often sounds unnatural due to duplicated notes and lack of harmonic consistency.

Core claim

The central claim is that modifying the Transformer's attention mechanism to reflect correlations among eight musical features—pitch, bar number, onset, duration, velocity, key, signature, and tempo—enables more effective capture of musical characteristics, leading to generated compositions with greater coherence, variation, and reduced repetition.

What carries the argument

Musical Attention, a modified attention process that incorporates meta-information and reflects correlations among eight note and structure features to better model musical composition.

If this is right

The model generates melodies with enhanced musical coherence and variation.
Repetition and duplication of notes are significantly reduced.
Harmonically consistent and diverse melodies are produced more effectively.
Outperforms prior methods like Full Attention and Strided Attention in overall quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could extend to other sequential arts like choreography or storytelling by embedding structural metadata.
Future work might test if the same principle applies to non-music audio like speech with prosody metadata.
Integrating more metadata types could further improve long-term dependency handling in generative models.

Load-bearing premise

That explicitly reflecting correlations among the eight features inside the attention mechanism will produce measurably better musical output than standard attention without introducing new failure modes or requiring extensive hyperparameter retuning.

What would settle it

A direct comparison experiment where the Musical Attention model fails to show statistically significant improvements in coherence or repetition metrics over baseline Transformer attention on a standardized music generation benchmark.

Figures

Figures reproduced from arXiv: 2605.21081 by Hideo Mukai, Shinnosuke Taksuka.

**Figure 2.** Figure 2: Model architecture of the Music Attention Transformer. The input [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MIDI data preprocessing. First, MIDI files collected [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Mechanism of Musical Attention. Two additional attention pat [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: The training process of the Transformer model. MIDI data are [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of multi-track music generated using the Full Attention [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Pitch heatmaps generated by the Full Attention and Musical At [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Learning curves of single-track music generation. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Learning curves of multi-track music generation. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Heatmaps during music generation (Musical Attention). [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of the C major scale and the C minor scale. A scale is [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Example of relative keys: C major and A minor. Relative keys [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Diatonic chords based on the C major scale. In the key of C [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

read the original abstract

This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tweaks transformer attention to correlate eight music features including metadata but the performance claims lack any numbers or controls to back them up.

read the letter

The punchline is that this work proposes a music-specific attention mechanism called Musical Attention that integrates correlations among eight features—pitch, bar number, onset, duration, velocity, key, signature, and tempo—directly into the transformer's attention computation to improve music generation quality. However, the supporting experiments are not detailed enough in the provided information to confirm the improvements. What the paper does is take the common issue of repetition and lack of harmonic consistency in transformer-generated music and try to fix it by making the attention aware of musical metadata. This is presented as a way to let the model better capture structural properties that standard attention might overlook. It does well in framing the problem clearly and offering a straightforward modification to the attention process. The description of representing each note with five events plus three metadata elements and then modifying attention to reflect their correlations is a clear, if incremental, step. The soft spots are mainly around the evaluation. The abstract states that the model outperforms Full Attention and Strided Attention in coherence, variation, and overall quality while reducing repetition, but it supplies no quantitative metrics, no information on the datasets used, no ablation results, and no statistical analysis. This leaves the data-to-claim connection unverified. The stress-test concern about baselines possibly not including the metadata elements is a valid one based on what's here; if the comparison is not apples-to-apples on inputs, then the gains could be attributed to the added features rather than the attention change itself. That would weaken the central argument. If the full manuscript includes proper controls and results, that would address this. As it stands, the claims are hard to assess without more evidence. This paper is for people in the AI music generation field who are looking at ways to incorporate domain knowledge into sequence models. A reader working on similar transformer applications for structured creative tasks might find the specific attention tweak worth considering or adapting. I recommend that it should go to peer review, but only after the authors provide the experimental details, metrics, and confirmation that baselines match on feature inputs. The idea has enough potential to warrant a closer look from referees.

Referee Report

2 major / 1 minor

Summary. The paper proposes Musical Attention, a modification to the standard Transformer attention mechanism for music generation. It represents each note using five events (pitch, bar number, onset, duration, velocity) plus three metadata elements (key, signature, tempo), then modifies attention to explicitly reflect correlations among these eight features. The central claim is that this yields superior musical coherence, variation, and overall quality with significantly reduced repetition compared to Full Attention and Strided Attention baselines.

Significance. If the experimental claims hold under controlled conditions, the work offers a targeted way to inject musical structure and metadata directly into attention, which could help mitigate repetition and improve harmonic consistency in generated music. The idea of feature-aware attention is plausible and builds on known Transformer limitations in sequential data with strong hierarchical structure.

major comments (2)

Abstract and Experimental Results: The performance claims (outperformance on coherence, variation, reduced repetition) are stated without any quantitative metrics, dataset details, statistical tests, ablation studies, or hyperparameter information, preventing evaluation of whether the gains are attributable to the attention modification.
Abstract and Experimental Results: The comparison with Full Attention and Strided Attention does not state whether those baselines receive the identical eight-feature inputs (five note events plus key/signature/tempo metadata) or only the five note events; if the baselines omit the metadata, the reported improvements confound the effect of Musical Attention with the simple addition of richer structural inputs.

minor comments (1)

Abstract: The phrasing 'five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements' is ambiguous because bar number appears in both lists; clarify the exact partitioning of the eight features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the experimental documentation and clarify the setup.

read point-by-point responses

Referee: Abstract and Experimental Results: The performance claims (outperformance on coherence, variation, reduced repetition) are stated without any quantitative metrics, dataset details, statistical tests, ablation studies, or hyperparameter information, preventing evaluation of whether the gains are attributable to the attention modification.

Authors: We agree that the current presentation of results is insufficiently quantitative. In the revised manuscript we will expand the experimental section to report concrete metrics (e.g., repetition rate, harmonic consistency via chord-progression entropy, and diversity via n-gram coverage), specify the dataset and its preprocessing, include hyperparameter tables, and add ablation studies that isolate the contribution of Musical Attention from the metadata features. Statistical significance testing will also be reported where appropriate. revision: yes
Referee: Abstract and Experimental Results: The comparison with Full Attention and Strided Attention does not state whether those baselines receive the identical eight-feature inputs (five note events plus key/signature/tempo metadata) or only the five note events; if the baselines omit the metadata, the reported improvements confound the effect of Musical Attention with the simple addition of richer structural inputs.

Authors: All models, including the Full Attention and Strided Attention baselines, were trained on the identical eight-feature representation (pitch, bar number, onset, duration, velocity plus key, signature, and tempo). The only difference is the attention operator itself. We will add an explicit statement to this effect in the abstract, methods, and experimental sections of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in model proposal or experimental claims

full rationale

The paper proposes Musical Attention as an architectural modification that incorporates eight features (five note events plus three metadata elements) into the attention computation and reports empirical improvements over Full Attention and Strided Attention baselines. No equations, derivations, or self-citations are presented that reduce the claimed performance gains to a tautology or fitted input by construction. The central claim rests on experimental comparisons whose independence from the modeling choice is not contradicted by any self-referential step in the provided text; the mechanism is introduced as a novel ansatz rather than derived from prior results of the same authors. This is the most common honest finding for a modeling paper whose value is carried by its empirical section rather than by a closed mathematical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the attention modification; all implementation details remain unspecified.

pith-pipeline@v0.9.0 · 5766 in / 1197 out tokens · 31163 ms · 2026-05-21T01:40:43.020989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 := 8 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

each musical note is represented as a combination of five events—pitch, bar number, onset, duration, and velocity in addition to the three metadata elements... correlations among these eight features
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat orbit and 8-tick periodicity unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Musical Attention... two main attention patterns: (1) attending to a limited set of preceding tokens within the same musical context, and (2) referencing specific tokens that share the same attribute

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 12 internal anchors

[1]

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Roberts A, Engel J, Raffel C, Simon I, Hawthorne C. MusicVAE: Cre- ating a palette for musical scores with machine learning. arXiv preprint arXiv:1803.05428. 2018

work page arXiv 2018
[2]

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

Brunner G, Konrad A, Wang Y, Wattenhofer R. MIDI-VAE: Model- ing dynamics and instrumentation of music with applications to style transfer. arXiv preprint arXiv:1809.07600. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Symbolic Music Genre Transfer with CycleGAN

Brunner G, Wang Y, Wattenhofer R, Zhao S. Symbolic music genre transfer with CycleGAN. arXiv preprint arXiv:1809.07575. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Attention Is All You Need

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. arXiv preprint arXiv:1706.03762. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

MusicBERT: Sym- bolic music understanding with large-scale pre-training

Zeng M, Tan X, Wang R, Ju Z, Qin T, Liu TY. MusicBERT: Sym- bolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630. 2021. 24

work page arXiv 2021
[6]

MidiBERT- Piano: Large-scale pre-training for symbolic music understanding

Chou YH, Chen IC, Chang CJ, Ching J, Yang YH. MidiBERT- Piano: Large-scale pre-training for symbolic music understanding. arXiv preprint arXiv:2107.05223. 2021

work page arXiv 2021
[7]

Strongly Recommend Advancing

Copet J, Kreuk F, Gat I, Remez T, Kant D, Synnaeve G, et al. Sim- ple and controllable music generation. arXiv preprint arXiv:2306.05284. 2023

work page arXiv 2023
[8]

Self-Attention with Relative Position Representations

Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Music Transformer

Huang CA, Vaswani A, Uszkoreit J, Shazeer N, Simon I, Hawthorne C, et al. Music Transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

MusicLM: Generating Music From Text

Agostinelli A, Denk TI, Borsos Z, Engel J, Verzetti M, Caillon A, et al. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

https://doi.org/10

Huang Q, Jansen A, Lee J, Ganti R, Li JY, Ellis DPW. MuLan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415. 2022

work page arXiv 2022
[12]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yang LC, Chou SY, Yang YH. MidiNet: A convolutional generative ad- versarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847. 2017. 25

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment

Dong HW, Hsiao WY, Yang LC, Yang YH. MuseGAN: Multi-track se- quential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:1709.06298. 2017

work page arXiv 2017
[14]

Generative Adversarial Networks

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. arXiv preprint arXiv:1406.2661. 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Payne C. MuseNet. OpenAI. 2019 Apr 25. Available from: https://openai.com/blog/musenet

work page 2019
[17]

Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions

Huang YS, Yang YH. Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions. arXiv preprint arXiv:2002.00212. 2020

work page arXiv 2002
[18]

MuseMorphose: Full-song and fine-grained pi- ano music style transfer with one Transformer VAE

Wu SL, Yang YH. MuseMorphose: Full-song and fine-grained pi- ano music style transfer with one Transformer VAE. arXiv preprint arXiv:2105.04090. 2021

work page arXiv 2021
[19]

Generating Long Sequences with Sparse Transformers

Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[20]

Language Models are Few-Shot Learners

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020. 26

work page internal anchor Pith review Pith/arXiv arXiv 2005
[21]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Audiolm: a language modeling approach to audio generation, 2023

Borsos Z, Marinier R, Vincent D, Kharitonov E, Pietquin O, Sharifi M, et al. AudioLM: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143. 2022

work page arXiv 2022
[23]

Soundstream: An end-to-end neural audio codec, 2021

Zeghidour N, Luebs A, Omran A, Skoglund J, Tagliasacchi M. SoundStream: An end-to-end neural audio codec. arXiv preprint arXiv:2107.03312. 2021

work page arXiv 2021
[24]

W2v- BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Chung YA, Zhang Y, Han W, Chiu CC, Qi J, Pang R, et al. W2v- BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. arXiv preprint arXiv:2108.06209. 2021

work page arXiv 2021
[25]

Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and match- ing [PhD thesis]

Raffel C. Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and match- ing [PhD thesis]. Columbia University; 2016. Available from: https://colinraffel.com/projects/lmd/

work page 2016
[26]

Miditoolkit: A Python package for working with MIDI files

Yating Music. Miditoolkit: A Python package for working with MIDI files. 2021. Available from: https://github.com/YatingMusic/miditoolkit 27 A Learning Curves A.1 Experiments for the Generation of Single-Track Music Figure 8 shows the learning curves for single-track music generation. (a) Train loss (b) Train accuracy (c) Eval loss (d) Eval accuracy Figur...

work page 2021

[1] [1]

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Roberts A, Engel J, Raffel C, Simon I, Hawthorne C. MusicVAE: Cre- ating a palette for musical scores with machine learning. arXiv preprint arXiv:1803.05428. 2018

work page arXiv 2018

[2] [2]

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

Brunner G, Konrad A, Wang Y, Wattenhofer R. MIDI-VAE: Model- ing dynamics and instrumentation of music with applications to style transfer. arXiv preprint arXiv:1809.07600. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Symbolic Music Genre Transfer with CycleGAN

Brunner G, Wang Y, Wattenhofer R, Zhao S. Symbolic music genre transfer with CycleGAN. arXiv preprint arXiv:1809.07575. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Attention Is All You Need

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. arXiv preprint arXiv:1706.03762. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

MusicBERT: Sym- bolic music understanding with large-scale pre-training

Zeng M, Tan X, Wang R, Ju Z, Qin T, Liu TY. MusicBERT: Sym- bolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630. 2021. 24

work page arXiv 2021

[6] [6]

MidiBERT- Piano: Large-scale pre-training for symbolic music understanding

Chou YH, Chen IC, Chang CJ, Ching J, Yang YH. MidiBERT- Piano: Large-scale pre-training for symbolic music understanding. arXiv preprint arXiv:2107.05223. 2021

work page arXiv 2021

[7] [7]

Strongly Recommend Advancing

Copet J, Kreuk F, Gat I, Remez T, Kant D, Synnaeve G, et al. Sim- ple and controllable music generation. arXiv preprint arXiv:2306.05284. 2023

work page arXiv 2023

[8] [8]

Self-Attention with Relative Position Representations

Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Music Transformer

Huang CA, Vaswani A, Uszkoreit J, Shazeer N, Simon I, Hawthorne C, et al. Music Transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

MusicLM: Generating Music From Text

Agostinelli A, Denk TI, Borsos Z, Engel J, Verzetti M, Caillon A, et al. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

https://doi.org/10

Huang Q, Jansen A, Lee J, Ganti R, Li JY, Ellis DPW. MuLan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415. 2022

work page arXiv 2022

[12] [12]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yang LC, Chou SY, Yang YH. MidiNet: A convolutional generative ad- versarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847. 2017. 25

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment

Dong HW, Hsiao WY, Yang LC, Yang YH. MuseGAN: Multi-track se- quential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:1709.06298. 2017

work page arXiv 2017

[14] [14]

Generative Adversarial Networks

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. arXiv preprint arXiv:1406.2661. 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Payne C. MuseNet. OpenAI. 2019 Apr 25. Available from: https://openai.com/blog/musenet

work page 2019

[17] [17]

Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions

Huang YS, Yang YH. Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions. arXiv preprint arXiv:2002.00212. 2020

work page arXiv 2002

[18] [18]

MuseMorphose: Full-song and fine-grained pi- ano music style transfer with one Transformer VAE

Wu SL, Yang YH. MuseMorphose: Full-song and fine-grained pi- ano music style transfer with one Transformer VAE. arXiv preprint arXiv:2105.04090. 2021

work page arXiv 2021

[19] [19]

Generating Long Sequences with Sparse Transformers

Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[20] [20]

Language Models are Few-Shot Learners

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020. 26

work page internal anchor Pith review Pith/arXiv arXiv 2005

[21] [21]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Audiolm: a language modeling approach to audio generation, 2023

Borsos Z, Marinier R, Vincent D, Kharitonov E, Pietquin O, Sharifi M, et al. AudioLM: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143. 2022

work page arXiv 2022

[23] [23]

Soundstream: An end-to-end neural audio codec, 2021

Zeghidour N, Luebs A, Omran A, Skoglund J, Tagliasacchi M. SoundStream: An end-to-end neural audio codec. arXiv preprint arXiv:2107.03312. 2021

work page arXiv 2021

[24] [24]

W2v- BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Chung YA, Zhang Y, Han W, Chiu CC, Qi J, Pang R, et al. W2v- BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. arXiv preprint arXiv:2108.06209. 2021

work page arXiv 2021

[25] [25]

Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and match- ing [PhD thesis]

Raffel C. Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and match- ing [PhD thesis]. Columbia University; 2016. Available from: https://colinraffel.com/projects/lmd/

work page 2016

[26] [26]

Miditoolkit: A Python package for working with MIDI files

Yating Music. Miditoolkit: A Python package for working with MIDI files. 2021. Available from: https://github.com/YatingMusic/miditoolkit 27 A Learning Curves A.1 Experiments for the Generation of Single-Track Music Figure 8 shows the learning curves for single-track music generation. (a) Train loss (b) Train accuracy (c) Eval loss (d) Eval accuracy Figur...

work page 2021