Classical Music Prediction and Composition by means of Variational Autoencoders
Pith reviewed 2026-05-25 18:12 UTC · model grok-4.3
The pith
A variational autoencoder can represent classical music in latent space for accurate predictions and composition of new pieces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that variational autoencoders provide a way to embed musical sequences into a latent space that supports both representation and prediction of future values, allowing the generation of new classical music either by continuing an existing piece or from a random starting point, and that this works even with small training sets.
What carries the argument
Variational autoencoder trained to reconstruct and predict music sequences in latent space.
If this is right
- Music composition is possible by predicting forward from an encoded existing piece.
- New pieces can be started from a random point in the latent space.
- The system generalizes to unseen classical pieces despite small training data.
- Accurate latent representations enable both prediction and generation tasks.
Where Pith is reading between the lines
- Similar models could be tested on other musical genres or non-Western traditions.
- The latent space might allow smooth transitions between different styles of music.
- Extending the prediction horizon could reveal limits of the learned structure.
Load-bearing premise
The latent representations learned by the VAE on a small set of classical pieces capture enough musical structure to support accurate future-value predictions on unseen data.
What would settle it
Running the trained model on a new classical piece withheld from training and checking whether its prediction error is substantially lower than that of a naive baseline such as repeating the previous note.
Figures
read the original abstract
This paper proposes a new model for music prediction based on Variational Autoencoders (VAEs). In this work, VAEs are used in a novel way in order to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece. This approach was trained with different songs of a classical composer. As a result, the system can represent the music in the latent space, and make accurate predictions. Therefore, the system can be used to compose new music either from an existing piece or from a random starting point. An additional feature of this system is that a small dataset was used for training. However, results show that the system is able to return accurate representations and predictions in unseen data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using Variational Autoencoders (VAEs) to encode classical music into a latent space and to perform future-value predictions for composition. It is trained on a small dataset of pieces from a single classical composer and asserts that the resulting system yields accurate representations and predictions on unseen data, enabling new music generation either by continuing existing pieces or from random starts.
Significance. If the prediction claims were supported by quantitative evidence, the work would be of interest for demonstrating VAE-based music modeling that operates with limited data; the emphasis on small training sets is a potentially useful angle given data constraints in symbolic music tasks. Without metrics or baselines, however, it is not possible to determine whether the latent representations capture musically meaningful temporal structure or to compare against prior VAE music models.
major comments (2)
- [Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.
- [Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and agree that adding quantitative support will improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and results discussion] Abstract (and results discussion): the central claim that the model produces 'accurate representations and predictions' on unseen data is unsupported by any reported quantitative metrics (test-set loss, MSE, note-level accuracy, reconstruction error) or validation protocol. No baselines (constant predictor, last-note hold, non-variational autoencoder) are mentioned, making it impossible to assess whether the latent-space predictions exploit learned structure or simply generate plausible continuations.
Authors: We agree that the abstract and results sections assert accuracy without supporting quantitative metrics or baselines. The current manuscript relies on qualitative demonstrations of generated music. We will revise the abstract, add a validation protocol with test-set reconstruction and prediction errors, and include comparisons against simple baselines such as last-note hold and a non-variational autoencoder. revision: yes
-
Referee: [Approach and results] The weakest assumption—that the VAE latents learned on a small single-composer corpus encode sufficient musical structure for reliable next-step prediction—is load-bearing for the composition application, yet no evidence (e.g., prediction error curves or latent-space interpolation examples with musical analysis) is supplied to test it.
Authors: The manuscript presents the prediction mechanism through composition examples but does not supply quantitative prediction error analysis or detailed latent-space studies. We will add prediction error curves on held-out data and include musical analysis of latent interpolations to better substantiate the assumption. revision: yes
Circularity Check
No circularity: standard VAE training and latent-space prediction on external data
full rationale
The manuscript trains a VAE on a small set of classical pieces, encodes music into latent space, and performs next-value prediction from that representation. This follows the ordinary VAE objective (reconstruction + KL) and standard decoder-based forecasting; no equation reduces the reported prediction accuracy to a fitted parameter by definition, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained and externally falsifiable on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A variational autoencoder can learn a latent representation of music that supports future-value prediction
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VAEs are used ... to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the system can represent the music in the latent space, and make accurate predictions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv e-prints, page arXiv:1609.05473, Sep 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Non-Monotonic Sequential Text Generation
Sean Welleck, Kianté Brantley, III Daumé, Hal, and Kyunghyun Cho. Non-Monotonic Sequential Text Generation. arXiv e-prints, page arXiv:1902.02192, Feb 2019
-
[3]
Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation
Yi-Hsuan Yang Hao-Wen Dong. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv preprint arXiv:/1804.09399, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
A. Ycart and E. Benetos. Polyphonic music sequence transduction with meter-constrained lstm networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 386–390, April 2018
work page 2018
-
[5]
beta-vae: Learning basic visual concepts with a constrained variational framework
Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017
work page 2017
-
[6]
Generative timbre spaces with variational audio synthesis
Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Generative timbre spaces with variational audio synthesis. In Proceedings of the 21st International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, September 4–8, 2018, 05 2018
work page 2018
-
[7]
Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders
Yin-Jyun Luo and Li Su. Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders. In Proceedings of the 19th International Society for Music Information Retrieval Conference , pages 653–660, Paris, France, September 2018. ISMIR
work page 2018
-
[8]
Generative Adversarial Networks
Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian J. Goodfellow, Jean Pouget-Abadie. Generative adversarial networks. arXiv preprint arXiv:/1406.2661, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[9]
Li-Chia Yang Yi-Hsuan Yang Hao-Wen Dong, Wen-Yi Hsiao. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:/1709.06298, 2017
-
[10]
MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Yi-Hsuan Yang Li-Chia Yang, Szu-Yu Chou. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:/1703.10847, 2017. 8 A PREPRINT - J UNE 25, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Hersch, Pr Esident, and Prof Paolo Frasconi
D Epartement D’informatique, Ese N, Pr Esent, Ee Au, Felix Gers, Prof R. Hersch, Pr Esident, and Prof Paolo Frasconi. Long short-term memory in recurrent neural networks, 05 2001
work page 2001
-
[12]
A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv e-prints, page arXiv:1803.05428, Mar 2018
-
[13]
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Moham- mad Norouzi. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. arXiv e-prints, page arXiv:1704.01279, Apr 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics
Philippe Esling, Axel Chemla–Romeu-Santos, and Adrien Bitton. Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv e-prints, page arXiv:1805.08501, May 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin. Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models. arXiv e-prints, page arXiv:1806.04096, Jun 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Merlijn Blaauw and J. Bonada. Modeling and transforming speech using variational autoencoders. In Interspeech, San Francisco, USA, 13/09/2016 2016
work page 2016
-
[17]
Learning Latent Representations for Speech Generation and Transformation
Wei-Ning Hsu, Yu Zhang, and James Glass. Learning Latent Representations for Speech Generation and Transformation. arXiv e-prints, page arXiv:1704.04222, Apr 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, Dec 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
How to train deep variational autoencoders and probabilistic ladder networks
Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. 02 2016
work page 2016
-
[20]
An introduction to roc analysis
Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition
work page 2006
-
[21]
The elements of statistical learning: data mining, inference and prediction
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009. 9
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.