pith. sign in

arxiv: 1906.09972 · v1 · pith:3YTGDZZGnew · submitted 2019-06-21 · 💻 cs.SD · cs.LG· cs.NE· eess.AS

Classical Music Prediction and Composition by means of Variational Autoencoders

Pith reviewed 2026-05-25 18:12 UTC · model grok-4.3

classification 💻 cs.SD cs.LGcs.NEeess.AS
keywords variational autoencodersmusic compositionmusic predictionlatent spaceclassical musicgenerative modelssequence modeling
0
0 comments X

The pith

A variational autoencoder can represent classical music in latent space for accurate predictions and composition of new pieces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational autoencoders trained on a small set of classical pieces learn to map music into a latent space. From this space the model can both reconstruct the original and forecast the next musical values. These forecasts work well enough on pieces not used in training that the model can generate new music by extending an input fragment or by beginning from a random latent point. A reader would care because this approach avoids the need for large datasets or explicit musical rules. If the claim holds, modest collections of music become sufficient to build predictive composition systems.

Core claim

The central claim is that variational autoencoders provide a way to embed musical sequences into a latent space that supports both representation and prediction of future values, allowing the generation of new classical music either by continuing an existing piece or from a random starting point, and that this works even with small training sets.

What carries the argument

Variational autoencoder trained to reconstruct and predict music sequences in latent space.

If this is right

  • Music composition is possible by predicting forward from an encoded existing piece.
  • New pieces can be started from a random point in the latent space.
  • The system generalizes to unseen classical pieces despite small training data.
  • Accurate latent representations enable both prediction and generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar models could be tested on other musical genres or non-Western traditions.
  • The latent space might allow smooth transitions between different styles of music.
  • Extending the prediction horizon could reveal limits of the learned structure.

Load-bearing premise

The latent representations learned by the VAE on a small set of classical pieces capture enough musical structure to support accurate future-value predictions on unseen data.

What would settle it

Running the trained model on a new classical piece withheld from training and checking whether its prediction error is substantially lower than that of a naive baseline such as repeating the previous note.

Figures

Figures reproduced from arXiv: 1906.09972 by Alejandro Pazos, Daniel Rivero, Enrique Fernandez-Blanco.

Figure 1
Figure 1. Figure 1: F1 score and accuracy for each configuration [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity, V P P and F1-score for the different threshold in the selected configuration As it was already said, the system tries to predict 1 second of music from T − 1 seconds of music. Therefore, this value of T is an important parameter. Low values doe not give enough information to predict the music. On the other hand, too large values may give too much information and make the training process too s… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the results in the reconstruction and prediction problems [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results in prediction on different moments [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

This paper proposes a new model for music prediction based on Variational Autoencoders (VAEs). In this work, VAEs are used in a novel way in order to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece. This approach was trained with different songs of a classical composer. As a result, the system can represent the music in the latent space, and make accurate predictions. Therefore, the system can be used to compose new music either from an existing piece or from a random starting point. An additional feature of this system is that a small dataset was used for training. However, results show that the system is able to return accurate representations and predictions in unseen data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes using Variational Autoencoders (VAEs) to encode classical music into a latent space and to perform future-value predictions for composition. It is trained on a small dataset of pieces from a single classical composer and asserts that the resulting system yields accurate representations and predictions on unseen data, enabling new music generation either by continuing existing pieces or from random starts.

Significance. If the prediction claims were supported by quantitative evidence, the work would be of interest for demonstrating VAE-based music modeling that operates with limited data; the emphasis on small training sets is a potentially useful angle given data constraints in symbolic music tasks. Without metrics or baselines, however, it is not possible to determine whether the latent representations capture musically meaningful temporal structure or to compare against prior VAE music models.

major comments (2)
  1. [Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.
  2. [Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and agree that adding quantitative support will improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.

    Authors: We agree that the abstract and results sections assert accuracy without supporting quantitative metrics or baselines. The current manuscript relies on qualitative demonstrations of generated music. We will revise the abstract, add a validation protocol with test-set reconstruction and prediction errors, and include comparisons against simple baselines such as last-note hold and a non-variational autoencoder. revision: yes

  2. Referee: [Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.

    Authors: The manuscript presents the prediction mechanism through composition examples but does not supply quantitative prediction error analysis or detailed latent-space studies. We will add prediction error curves on held-out data and include musical analysis of latent interpolations to better substantiate the assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: standard VAE training and latent-space prediction on external data

full rationale

The manuscript trains a VAE on a small set of classical pieces, encodes music into latent space, and performs next-value prediction from that representation. This follows the ordinary VAE objective (reconstruction + KL) and standard decoder-based forecasting; no equation reduces the reported prediction accuracy to a fitted parameter by definition, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained and externally falsifiable on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the unproven premise that a standard VAE latent space trained on limited classical excerpts will generalize to accurate temporal predictions; no free parameters, new entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption A variational autoencoder can learn a latent representation of music that supports future-value prediction
    This is the core modeling choice invoked when the abstract states the system represents music in latent space and makes predictions.

pith-pipeline@v0.9.0 · 5668 in / 1121 out tokens · 22619 ms · 2026-05-25T18:12:30.060347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

  1. [1]

    SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

    Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv e-prints, page arXiv:1609.05473, Sep 2016

  2. [2]

    Non-Monotonic Sequential Text Generation

    Sean Welleck, Kianté Brantley, III Daumé, Hal, and Kyunghyun Cho. Non-Monotonic Sequential Text Generation. arXiv e-prints, page arXiv:1902.02192, Feb 2019

  3. [3]

    Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

    Yi-Hsuan Yang Hao-Wen Dong. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv preprint arXiv:/1804.09399, 2018

  4. [4]

    Ycart and E

    A. Ycart and E. Benetos. Polyphonic music sequence transduction with meter-constrained lstm networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 386–390, April 2018

  5. [5]

    beta-vae: Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017

  6. [6]

    Generative timbre spaces with variational audio synthesis

    Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Generative timbre spaces with variational audio synthesis. In Proceedings of the 21st International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, September 4–8, 2018, 05 2018

  7. [7]

    Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders

    Yin-Jyun Luo and Li Su. Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders. In Proceedings of the 19th International Society for Music Information Retrieval Conference , pages 653–660, Paris, France, September 2018. ISMIR

  8. [8]

    Generative Adversarial Networks

    Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian J. Goodfellow, Jean Pouget-Abadie. Generative adversarial networks. arXiv preprint arXiv:/1406.2661, 2014

  9. [9]

    Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment

    Li-Chia Yang Yi-Hsuan Yang Hao-Wen Dong, Wen-Yi Hsiao. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:/1709.06298, 2017

  10. [10]

    MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

    Yi-Hsuan Yang Li-Chia Yang, Szu-Yu Chou. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:/1703.10847, 2017. 8 A PREPRINT - J UNE 25, 2019

  11. [11]

    Hersch, Pr Esident, and Prof Paolo Frasconi

    D Epartement D’informatique, Ese N, Pr Esent, Ee Au, Felix Gers, Prof R. Hersch, Pr Esident, and Prof Paolo Frasconi. Long short-term memory in recurrent neural networks, 05 2001

  12. [12]

    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

    Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv e-prints, page arXiv:1803.05428, Mar 2018

  13. [13]

    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

    Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Moham- mad Norouzi. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. arXiv e-prints, page arXiv:1704.01279, Apr 2017

  14. [14]

    Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

    Philippe Esling, Axel Chemla–Romeu-Santos, and Adrien Bitton. Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv e-prints, page arXiv:1805.08501, May 2018

  15. [15]

    Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

    Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin. Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models. arXiv e-prints, page arXiv:1806.04096, Jun 2018

  16. [16]

    Merlijn Blaauw and J. Bonada. Modeling and transforming speech using variational autoencoders. In Interspeech, San Francisco, USA, 13/09/2016 2016

  17. [17]

    Learning Latent Representations for Speech Generation and Transformation

    Wei-Ning Hsu, Yu Zhang, and James Glass. Learning Latent Representations for Speech Generation and Transformation. arXiv e-prints, page arXiv:1704.04222, Apr 2017

  18. [18]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, Dec 2013

  19. [19]

    How to train deep variational autoencoders and probabilistic ladder networks

    Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. 02 2016

  20. [20]

    An introduction to roc analysis

    Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition

  21. [21]

    The elements of statistical learning: data mining, inference and prediction

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. 9