Classical Music Prediction and Composition by means of Variational Autoencoders

Alejandro Pazos; Daniel Rivero; Enrique Fernandez-Blanco

arxiv: 1906.09972 · v1 · pith:3YTGDZZGnew · submitted 2019-06-21 · 💻 cs.SD · cs.LG· cs.NE· eess.AS

Classical Music Prediction and Composition by means of Variational Autoencoders

Daniel Rivero , Enrique Fernandez-Blanco , Alejandro Pazos This is my paper

Pith reviewed 2026-05-25 18:12 UTC · model grok-4.3

classification 💻 cs.SD cs.LGcs.NEeess.AS

keywords variational autoencodersmusic compositionmusic predictionlatent spaceclassical musicgenerative modelssequence modeling

0 comments

The pith

A variational autoencoder can represent classical music in latent space for accurate predictions and composition of new pieces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational autoencoders trained on a small set of classical pieces learn to map music into a latent space. From this space the model can both reconstruct the original and forecast the next musical values. These forecasts work well enough on pieces not used in training that the model can generate new music by extending an input fragment or by beginning from a random latent point. A reader would care because this approach avoids the need for large datasets or explicit musical rules. If the claim holds, modest collections of music become sufficient to build predictive composition systems.

Core claim

The central claim is that variational autoencoders provide a way to embed musical sequences into a latent space that supports both representation and prediction of future values, allowing the generation of new classical music either by continuing an existing piece or from a random starting point, and that this works even with small training sets.

What carries the argument

Variational autoencoder trained to reconstruct and predict music sequences in latent space.

If this is right

Music composition is possible by predicting forward from an encoded existing piece.
New pieces can be started from a random point in the latent space.
The system generalizes to unseen classical pieces despite small training data.
Accurate latent representations enable both prediction and generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar models could be tested on other musical genres or non-Western traditions.
The latent space might allow smooth transitions between different styles of music.
Extending the prediction horizon could reveal limits of the learned structure.

Load-bearing premise

The latent representations learned by the VAE on a small set of classical pieces capture enough musical structure to support accurate future-value predictions on unseen data.

What would settle it

Running the trained model on a new classical piece withheld from training and checking whether its prediction error is substantially lower than that of a naive baseline such as repeating the previous note.

Figures

Figures reproduced from arXiv: 1906.09972 by Alejandro Pazos, Daniel Rivero, Enrique Fernandez-Blanco.

**Figure 2.** Figure 2: Sensitivity, V P P and F1-score for the different threshold in the selected configuration As it was already said, the system tries to predict 1 second of music from T − 1 seconds of music. Therefore, this value of T is an important parameter. Low values doe not give enough information to predict the music. On the other hand, too large values may give too much information and make the training process too s… view at source ↗

**Figure 3.** Figure 3: Comparison of the results in the reconstruction and prediction problems [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Results in prediction on different moments [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

This paper proposes a new model for music prediction based on Variational Autoencoders (VAEs). In this work, VAEs are used in a novel way in order to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece. This approach was trained with different songs of a classical composer. As a result, the system can represent the music in the latent space, and make accurate predictions. Therefore, the system can be used to compose new music either from an existing piece or from a random starting point. An additional feature of this system is that a small dataset was used for training. However, results show that the system is able to return accurate representations and predictions in unseen data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard VAE applied to small classical music dataset claims accurate predictions but supplies zero metrics or baselines to check them.

read the letter

The main takeaway is that this paper trains a plain VAE on a handful of pieces from one classical composer, maps the sequences into latent space, and asserts that the model then makes accurate future-value predictions on unseen data, enabling composition from existing pieces or random starts. The small training set is called out as a positive feature. That is basically the whole contribution. By 2019 the use of VAEs for sequential data like this was already in the literature, so the work is an application rather than a new technique. It does at least demonstrate that the standard VAE objective can be run on limited music data without obvious implementation problems. The small-data angle is worth noting for anyone who has to work with modest corpora. The central weakness is the complete lack of numbers. The abstract states that representations and predictions are accurate on unseen data, yet there are no test losses, no note-level accuracies, no comparison against even a last-note-hold baseline, and no description of how the prediction step is actually performed or validated. Without those details it is impossible to know whether the latent space is capturing any useful temporal structure or whether the outputs are just plausible by construction. The stress-test note correctly identifies this gap. The paper is therefore mainly of interest to someone looking for a minimal working example of VAE music modeling. It is not the sort of thing I would bring to a reading group, and I would not cite it. It does not look ready for peer review until the authors add a proper evaluation section with quantitative results and baselines.

Referee Report

2 major / 0 minor

Summary. The paper proposes using Variational Autoencoders (VAEs) to encode classical music into a latent space and to perform future-value predictions for composition. It is trained on a small dataset of pieces from a single classical composer and asserts that the resulting system yields accurate representations and predictions on unseen data, enabling new music generation either by continuing existing pieces or from random starts.

Significance. If the prediction claims were supported by quantitative evidence, the work would be of interest for demonstrating VAE-based music modeling that operates with limited data; the emphasis on small training sets is a potentially useful angle given data constraints in symbolic music tasks. Without metrics or baselines, however, it is not possible to determine whether the latent representations capture musically meaningful temporal structure or to compare against prior VAE music models.

major comments (2)

[Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.
[Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and agree that adding quantitative support will improve the manuscript.

read point-by-point responses

Referee: [Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.

Authors: We agree that the abstract and results sections assert accuracy without supporting quantitative metrics or baselines. The current manuscript relies on qualitative demonstrations of generated music. We will revise the abstract, add a validation protocol with test-set reconstruction and prediction errors, and include comparisons against simple baselines such as last-note hold and a non-variational autoencoder. revision: yes
Referee: [Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.

Authors: The manuscript presents the prediction mechanism through composition examples but does not supply quantitative prediction error analysis or detailed latent-space studies. We will add prediction error curves on held-out data and include musical analysis of latent interpolations to better substantiate the assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: standard VAE training and latent-space prediction on external data

full rationale

The manuscript trains a VAE on a small set of classical pieces, encodes music into latent space, and performs next-value prediction from that representation. This follows the ordinary VAE objective (reconstruction + KL) and standard decoder-based forecasting; no equation reduces the reported prediction accuracy to a fitted parameter by definition, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained and externally falsifiable on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the unproven premise that a standard VAE latent space trained on limited classical excerpts will generalize to accurate temporal predictions; no free parameters, new entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption A variational autoencoder can learn a latent representation of music that supports future-value prediction
This is the core modeling choice invoked when the abstract states the system represents music in latent space and makes predictions.

pith-pipeline@v0.9.0 · 5668 in / 1121 out tokens · 22619 ms · 2026-05-25T18:12:30.060347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VAEs are used ... to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the system can represent the music in the latent space, and make accurate predictions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

[1]

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv e-prints, page arXiv:1609.05473, Sep 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Non-Monotonic Sequential Text Generation

Sean Welleck, Kianté Brantley, III Daumé, Hal, and Kyunghyun Cho. Non-Monotonic Sequential Text Generation. arXiv e-prints, page arXiv:1902.02192, Feb 2019

work page arXiv 1902
[3]

Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Yi-Hsuan Yang Hao-Wen Dong. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv preprint arXiv:/1804.09399, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Ycart and E

A. Ycart and E. Benetos. Polyphonic music sequence transduction with meter-constrained lstm networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 386–390, April 2018

work page 2018
[5]

beta-vae: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017

work page 2017
[6]

Generative timbre spaces with variational audio synthesis

Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Generative timbre spaces with variational audio synthesis. In Proceedings of the 21st International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, September 4–8, 2018, 05 2018

work page 2018
[7]

Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders

Yin-Jyun Luo and Li Su. Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders. In Proceedings of the 19th International Society for Music Information Retrieval Conference , pages 653–660, Paris, France, September 2018. ISMIR

work page 2018
[8]

Generative Adversarial Networks

Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian J. Goodfellow, Jean Pouget-Abadie. Generative adversarial networks. arXiv preprint arXiv:/1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment

Li-Chia Yang Yi-Hsuan Yang Hao-Wen Dong, Wen-Yi Hsiao. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:/1709.06298, 2017

work page arXiv 2017
[10]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yi-Hsuan Yang Li-Chia Yang, Szu-Yu Chou. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:/1703.10847, 2017. 8 A PREPRINT - J UNE 25, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Hersch, Pr Esident, and Prof Paolo Frasconi

D Epartement D’informatique, Ese N, Pr Esent, Ee Au, Felix Gers, Prof R. Hersch, Pr Esident, and Prof Paolo Frasconi. Long short-term memory in recurrent neural networks, 05 2001

work page 2001
[12]

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv e-prints, page arXiv:1803.05428, Mar 2018

work page arXiv 2018
[13]

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Moham- mad Norouzi. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. arXiv e-prints, page arXiv:1704.01279, Apr 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

Philippe Esling, Axel Chemla–Romeu-Santos, and Adrien Bitton. Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv e-prints, page arXiv:1805.08501, May 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin. Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models. arXiv e-prints, page arXiv:1806.04096, Jun 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Merlijn Blaauw and J. Bonada. Modeling and transforming speech using variational autoencoders. In Interspeech, San Francisco, USA, 13/09/2016 2016

work page 2016
[17]

Learning Latent Representations for Speech Generation and Transformation

Wei-Ning Hsu, Yu Zhang, and James Glass. Learning Latent Representations for Speech Generation and Transformation. arXiv e-prints, page arXiv:1704.04222, Apr 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, Dec 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

How to train deep variational autoencoders and probabilistic ladder networks

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. 02 2016

work page 2016
[20]

An introduction to roc analysis

Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition

work page 2006
[21]

The elements of statistical learning: data mining, inference and prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. 9

work page 2009

[1] [1]

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv e-prints, page arXiv:1609.05473, Sep 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Non-Monotonic Sequential Text Generation

Sean Welleck, Kianté Brantley, III Daumé, Hal, and Kyunghyun Cho. Non-Monotonic Sequential Text Generation. arXiv e-prints, page arXiv:1902.02192, Feb 2019

work page arXiv 1902

[3] [3]

Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Yi-Hsuan Yang Hao-Wen Dong. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv preprint arXiv:/1804.09399, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Ycart and E

A. Ycart and E. Benetos. Polyphonic music sequence transduction with meter-constrained lstm networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 386–390, April 2018

work page 2018

[5] [5]

beta-vae: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017

work page 2017

[6] [6]

Generative timbre spaces with variational audio synthesis

Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Generative timbre spaces with variational audio synthesis. In Proceedings of the 21st International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, September 4–8, 2018, 05 2018

work page 2018

[7] [7]

Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders

Yin-Jyun Luo and Li Su. Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders. In Proceedings of the 19th International Society for Music Information Retrieval Conference , pages 653–660, Paris, France, September 2018. ISMIR

work page 2018

[8] [8]

Generative Adversarial Networks

Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian J. Goodfellow, Jean Pouget-Abadie. Generative adversarial networks. arXiv preprint arXiv:/1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[9] [9]

Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment

Li-Chia Yang Yi-Hsuan Yang Hao-Wen Dong, Wen-Yi Hsiao. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:/1709.06298, 2017

work page arXiv 2017

[10] [10]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yi-Hsuan Yang Li-Chia Yang, Szu-Yu Chou. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:/1703.10847, 2017. 8 A PREPRINT - J UNE 25, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Hersch, Pr Esident, and Prof Paolo Frasconi

D Epartement D’informatique, Ese N, Pr Esent, Ee Au, Felix Gers, Prof R. Hersch, Pr Esident, and Prof Paolo Frasconi. Long short-term memory in recurrent neural networks, 05 2001

work page 2001

[12] [12]

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv e-prints, page arXiv:1803.05428, Mar 2018

work page arXiv 2018

[13] [13]

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Moham- mad Norouzi. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. arXiv e-prints, page arXiv:1704.01279, Apr 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

Philippe Esling, Axel Chemla–Romeu-Santos, and Adrien Bitton. Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv e-prints, page arXiv:1805.08501, May 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin. Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models. arXiv e-prints, page arXiv:1806.04096, Jun 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Merlijn Blaauw and J. Bonada. Modeling and transforming speech using variational autoencoders. In Interspeech, San Francisco, USA, 13/09/2016 2016

work page 2016

[17] [17]

Learning Latent Representations for Speech Generation and Transformation

Wei-Ning Hsu, Yu Zhang, and James Glass. Learning Latent Representations for Speech Generation and Transformation. arXiv e-prints, page arXiv:1704.04222, Apr 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, Dec 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[19] [19]

How to train deep variational autoencoders and probabilistic ladder networks

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. 02 2016

work page 2016

[20] [20]

An introduction to roc analysis

Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition

work page 2006

[21] [21]

The elements of statistical learning: data mining, inference and prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. 9

work page 2009