MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Junmin Wu; Xia Liang; Yan Yin

arxiv: 1907.01607 · v2 · pith:J4WG2C75new · submitted 2019-07-02 · 📡 eess.AS · cs.LG· cs.MM· cs.SD

MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Xia Liang , Junmin Wu , Yan Yin This is my paper

Pith reviewed 2026-05-25 10:08 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.MMcs.SD

keywords music generationVAE-GANhierarchical conditional modelsymbolic musicsingle-track melodyMIDImusical formNottingham dataset

0 comments

The pith

A hierarchical conditional VAE-GAN generates single-track music sequences of 136 beats that include musical form and direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most existing models create separate music bars and splice them together, producing songs that lack overall structure or sense of direction. The MIDI-Sandwich model uses a two-layer hierarchy to fix this. Its lower L-CVAE layer generates individual bars conditioned on specified first and last notes. The upper G-VAE layer then processes the sequence of latent vectors from those bars to enforce relationships that give the full piece form and direction. This VAE structure shares components with a paired HCGAN, and the combined system produces longer single-track melodies than typical models while incorporating elements like tonic and melodic motion.

Core claim

The MIDI-Sandwich combines HCVAE and HCGAN in a multi-model multi-task setup. The HCVAE's lower L-CVAE generates bars pre-specified by first and last notes, while its upper G-VAE analyzes the resulting latent vector sequence to explore musical relationships between bars and assemble them into songs that possess both structure and direction; sharing part of the HCVAE with the HCGAN further improves output quality, enabling single-track melody sequences of 17x8 beats on the Nottingham dataset.

What carries the argument

MIDI-Sandwich hierarchical conditional VAE-GAN, where the lower L-CVAE generates bars conditioned on first and last notes and the upper G-VAE models relationships across the latent vector sequence from multiple bars.

If this is right

Generated music reaches 136 beats, exceeding the typical length range of 8 to 32 beats in prior models.
Songs exhibit explicit musical form, tonic, and melodic motion through the global analysis of bar relationships.
Component sharing between the hierarchical VAE and GAN improves generation performance beyond the VAE alone.
The model is shown effective through standard evaluation protocols on the Nottingham dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stacking additional hierarchy levels could enable coherent generation at even greater lengths by extending the same bar-relationship mechanism.
Conditioning generation on first and last notes of each bar offers a general way to inject local structure constraints into sequential creative tasks.
The separation of local bar generation from global relationship modeling may apply to other domains that need both detail and long-range coherence.

Load-bearing premise

The global VAE layer can analyze the sequence of latent vectors to capture musical relationships between bars and thereby produce songs that have structure and direction.

What would settle it

If side-by-side listening tests or structural metrics on the Nottingham dataset show no measurable gain in musical direction or form for 136-beat outputs versus simple bar-splicing methods, the benefit of the hierarchical G-VAE layer would be falsified.

Figures

Figures reproduced from arXiv: 1907.01607 by Junmin Wu, Xia Liang, Yan Yin.

**Figure 3.** Figure 3: Task1 model: L-CVAE [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Three music fragments within a song that have relationships [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Long-term music clips generated by MIDI-Sandwich [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Most existing neural network models for music generation explore how to generate music bars, then directly splice the music bars into a song. However, these methods do not explore the relationship between the bars, and the connected song as a whole has no musical form structure and sense of musical direction. To address this issue, we propose a Multi-model Multi-task Hierarchical Conditional VAE-GAN (Variational Autoencoder-Generative adversarial networks) networks, named MIDI-Sandwich, which combines musical knowledge, such as musical form, tonic, and melodic motion. The MIDI-Sandwich has two submodels: Hierarchical Conditional Variational Autoencoder (HCVAE) and Hierarchical Conditional Generative Adversarial Network (HCGAN). The HCVAE uses hierarchical structure. The underlying layer of HCVAE uses Local Conditional Variational Autoencoder (L-CVAE) to generate a music bar which is pre-specified by the First and Last Notes (FLN). The upper layer of HCVAE uses Global Variational Autoencoder(G-VAE) to analyze the latent vector sequence generated by the L-CVAE encoder, to explore the musical relationship between the bars, and to produce the song pieced together by multiple music bars generated by the L-CVAE decoder, which makes the song both have musical structure and sense of direction. At the same time, the HCVAE shares a part of itself with the HCGAN to further improve the performance of the generated music. The MIDI-Sandwich is validated on the Nottingham dataset and is able to generate a single-track melody sequence (17x8 beats), which is superior to the length of most of the generated models (8 to 32 beats). Meanwhile, by referring to the experimental methods of many classical kinds of literature, the quality evaluation of the generated music is performed. The above experiments prove the validity of the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a hierarchical VAE-GAN for longer single-track MIDI with bar-level conditioning, but supplies no metrics or ablations to show the global layer actually adds structure.

read the letter

The core idea is to stop just concatenating bars and instead use an upper G-VAE layer on the latent sequence from the lower L-CVAE to capture relationships between bars, plus some sharing with a conditional GAN. That multi-model setup and the first-last-note conditioning on bars are the concrete extensions they propose. The architecture description is clear enough on paper and shows they thought about musical form and direction rather than treating generation as pure sequence modeling.

Referee Report

3 major / 2 minor

Summary. The paper proposes MIDI-Sandwich, a multi-model multi-task hierarchical conditional VAE-GAN architecture for single-track symbolic music generation. It consists of HCVAE (L-CVAE for generating bars conditioned on first/last notes, G-VAE for analyzing latent sequences to capture inter-bar relationships and produce structured output with musical form/direction) sharing components with HCGAN; the model is trained and evaluated on the Nottingham dataset and claims to generate longer sequences (17x8 beats) with incorporated musical knowledge, with validity 'proved' by reference to classical evaluation methods.

Significance. A working hierarchical mechanism that demonstrably improves long-range musical structure over flat bar-splicing baselines would be a useful contribution to conditional music generation. The multi-task VAE-GAN sharing is a reasonable design choice. However, the absence of any quantitative results means the significance cannot yet be assessed from the manuscript.

major comments (3)

[Abstract] Abstract: the statement that 'the above experiments prove the validity of the model' is unsupported; no quantitative metrics, baselines, error bars, or ablation results are reported anywhere in the manuscript for musical quality, structure, or direction.
[Abstract / §4] Abstract / §4 (results): the central claim that G-VAE 'analyzes the latent vector sequence ... to explore the musical relationship between the bars' and produces songs with form, tonic, and direction lacks any supporting metric (e.g., repetition rate, pitch-contour consistency, form adherence) or comparison to an L-CVAE-only ablation.
[Abstract] Abstract: the assertion of superiority in length ('17x8 beats' vs. '8 to 32 beats') and incorporation of musical knowledge is presented without any objective or subjective evaluation scores, tables, or controls against prior models.

minor comments (2)

[Abstract] The notation '17x8 beats' is ambiguous and should be clarified (bars × beats per bar or total length).
[Throughout] Model-component acronyms (L-CVAE, G-VAE, FLN, HCVAE, HCGAN) should be defined on first use and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that the manuscript's claims regarding model validity, musical structure, and superiority require quantitative substantiation, which is currently absent. We will revise the paper to include the requested metrics, ablations, and comparisons.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'the above experiments prove the validity of the model' is unsupported; no quantitative metrics, baselines, error bars, or ablation results are reported anywhere in the manuscript for musical quality, structure, or direction.

Authors: We agree that this statement is unsupported in the current manuscript. The revised version will remove or qualify the claim and incorporate quantitative metrics (e.g., repetition rate, pitch-contour consistency) with baselines and error bars drawn from classical music generation evaluation methods. revision: yes
Referee: [Abstract / §4] Abstract / §4 (results): the central claim that G-VAE 'analyzes the latent vector sequence ... to explore the musical relationship between the bars' and produces songs with form, tonic, and direction lacks any supporting metric (e.g., repetition rate, pitch-contour consistency, form adherence) or comparison to an L-CVAE-only ablation.

Authors: The manuscript currently presents this as a qualitative outcome without explicit metrics or ablations. In revision we will add supporting quantitative measures for form, tonic adherence, and direction, together with a direct L-CVAE-only ablation study. revision: yes
Referee: [Abstract] Abstract: the assertion of superiority in length ('17x8 beats' vs. '8 to 32 beats') and incorporation of musical knowledge is presented without any objective or subjective evaluation scores, tables, or controls against prior models.

Authors: We acknowledge that superiority claims require supporting evidence. The revision will include objective and subjective evaluation scores, comparison tables against prior models, and controls for length and musical-knowledge incorporation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external dataset validation

full rationale

The paper proposes a neural network architecture (HCVAE with L-CVAE/G-VAE layers plus HCGAN sharing) for symbolic music generation and evaluates it empirically on the Nottingham dataset. No mathematical derivation chain, equations, or first-principles results are presented that could reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The claim that the G-VAE layer captures inter-bar relationships is an architectural assertion supported by the model description and experimental outcomes rather than any circular reduction. This is a standard empirical ML paper whose validity rests on external data and evaluation, not internal self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper relies on standard deep learning assumptions for generative models and introduces many tunable parameters typical of neural architectures; no new physical entities are postulated.

free parameters (1)

network architectures, layer sizes, and training hyperparameters
Neural network models require extensive tuning of parameters during training on the dataset.

axioms (2)

domain assumption Hierarchical structure in VAE can separately model local bar generation and global song-level relationships
Invoked in the design of L-CVAE and G-VAE layers.
domain assumption Sharing components between HCVAE and HCGAN improves generation quality
Stated as part of the multi-model multi-task design.

pith-pipeline@v0.9.0 · 5891 in / 1427 out tokens · 67431 ms · 2026-05-25T10:08:44.607886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 10 internal anchors

[1]

In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Akbari, M., Liang, J.: Semi-recurrent cnn-based vae-gan for sequential data gen- eration. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2321–2325. IEEE (2018)

work page 2018
[2]

Wasserstein GAN

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: Modeling temporal depen- dencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

Generating Sentences from a Continuous Space

Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Gen- erating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

arXiv preprint arXiv:1709.01620 (2017)

Briot, J.P., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation- a survey. arXiv preprint arXiv:1709.01620 (2017)

work page arXiv 2017
[6]

In: International Workshop on Intelligent Virtual Agents

Casella, P., Paiva, A.: Magenta: An architecture for real time automatic com- position of background music. In: International Workshop on Intelligent Virtual Agents. pp. 224–232. Springer (2001)

work page 2001
[7]

Song From PI: A Musically Plausible Network for Pop Music Generation

Chu, H., Urtasun, R., Fidler, S.: Song from pi: A musically plausible network for pop music generation. arXiv preprint arXiv:1611.03477 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

In: Proceedings of the AISB 2003 Symposium on Artiﬁcial Intelligence and Creativity in the Arts and Sciences

Conklin, D.: Music generation from statistical models. In: Proceedings of the AISB 2003 Symposium on Artiﬁcial Intelligence and Creativity in the Arts and Sciences. pp. 30–35 (2003)

work page 2003
[9]

Music Style Transfer: A Position Paper

Dai, S., Zhang, Z., Xia, G.G.: Music style transfer: A position paper. arXiv preprint arXiv:1803.06841 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

In: Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018)

Dong, H.W., Hsiao, W.Y., Yang, L.C., Yang, Y.H.: Musegan: Multi-track sequen- tial generative adversarial networks for symbolic music generation and accompa- niment. In: Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018)

work page 2018
[11]

In: 2019 IEEE International Conference on Consumer Electronics (ICCE)

Fessahaye, F., Perez, L., Zhan, T., Zhang, R., Fossier, C., Markarian, R., Chiu, C., Zhan, J., Gewali, L., Oh, P.: T-recsys: A novel music recommendation system using deep learning. In: 2019 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–6. IEEE (2019)

work page 2019
[12]

In: Advances in neural information processing systems

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)

work page 2014
[13]

In: Advances in Neural Information Processing Sys- tems

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Sys- tems. pp. 5767–5777 (2017)

work page 2017
[14]

Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A.M., Hoﬀman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure (2018)

work page 2018
[15]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

Mogren, O.: C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

IEEE Signal Processing Magazine 36(1), 41–51 (2019)

Nam, J., Choi, K., Lee, J., Chou, S.Y., Yang, Y.H.: Deep learning for audio-based music classiﬁcation and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1), 41–51 (2019)

work page 2019
[18]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

In: NIPS Workshop on Machine Learning for Creativity and Design (2017)

Roberts, A., Engel, J., Eck, D.: Hierarchical variational autoencoders for music. In: NIPS Workshop on Machine Learning for Creativity and Design (2017)

work page 2017
[20]

In: Advances in neural information processing sys- tems

Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in neural information processing sys- tems. pp. 3483–3491 (2015)

work page 2015
[21]

Waite, E., Eck, D., Roberts, A., Abolaﬁa, D.: Project magenta: generating long- term structure in songs and stories (2016)

work page 2016
[22]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yang, L.C., Chou, S.Y., Yang, Y.H.: Midinet: A convolutional generative adversar- ial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Inspecting and Interacting with Meaningful Music Representations using VAE

Yang, R., Chen, T., Zhang, Y., Xia, G.: Inspecting and interacting with meaningful music representations using vae. arXiv preprint arXiv:1904.08842 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[24]

In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (2017)

Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: Sequence generative adversarial nets with policy gradient. In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (2017)

work page 2017
[25]

In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Zhu, H., Liu, Q., Yuan, N.J., Qin, C., Li, J., Zhang, K., Zhou, G., Wei, F., Xu, Y., Chen, E.: Xiaoice band: A melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2837–2846. ACM (2018)

work page 2018

[1] [1]

In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Akbari, M., Liang, J.: Semi-recurrent cnn-based vae-gan for sequential data gen- eration. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2321–2325. IEEE (2018)

work page 2018

[2] [2]

Wasserstein GAN

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: Modeling temporal depen- dencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

Generating Sentences from a Continuous Space

Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Gen- erating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

arXiv preprint arXiv:1709.01620 (2017)

Briot, J.P., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation- a survey. arXiv preprint arXiv:1709.01620 (2017)

work page arXiv 2017

[6] [6]

In: International Workshop on Intelligent Virtual Agents

Casella, P., Paiva, A.: Magenta: An architecture for real time automatic com- position of background music. In: International Workshop on Intelligent Virtual Agents. pp. 224–232. Springer (2001)

work page 2001

[7] [7]

Song From PI: A Musically Plausible Network for Pop Music Generation

Chu, H., Urtasun, R., Fidler, S.: Song from pi: A musically plausible network for pop music generation. arXiv preprint arXiv:1611.03477 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

In: Proceedings of the AISB 2003 Symposium on Artiﬁcial Intelligence and Creativity in the Arts and Sciences

Conklin, D.: Music generation from statistical models. In: Proceedings of the AISB 2003 Symposium on Artiﬁcial Intelligence and Creativity in the Arts and Sciences. pp. 30–35 (2003)

work page 2003

[9] [9]

Music Style Transfer: A Position Paper

Dai, S., Zhang, Z., Xia, G.G.: Music style transfer: A position paper. arXiv preprint arXiv:1803.06841 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

In: Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018)

Dong, H.W., Hsiao, W.Y., Yang, L.C., Yang, Y.H.: Musegan: Multi-track sequen- tial generative adversarial networks for symbolic music generation and accompa- niment. In: Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018)

work page 2018

[11] [11]

In: 2019 IEEE International Conference on Consumer Electronics (ICCE)

Fessahaye, F., Perez, L., Zhan, T., Zhang, R., Fossier, C., Markarian, R., Chiu, C., Zhan, J., Gewali, L., Oh, P.: T-recsys: A novel music recommendation system using deep learning. In: 2019 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–6. IEEE (2019)

work page 2019

[12] [12]

In: Advances in neural information processing systems

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)

work page 2014

[13] [13]

In: Advances in Neural Information Processing Sys- tems

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Sys- tems. pp. 5767–5777 (2017)

work page 2017

[14] [14]

Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A.M., Hoﬀman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure (2018)

work page 2018

[15] [15]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

Mogren, O.: C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

IEEE Signal Processing Magazine 36(1), 41–51 (2019)

Nam, J., Choi, K., Lee, J., Chou, S.Y., Yang, Y.H.: Deep learning for audio-based music classiﬁcation and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1), 41–51 (2019)

work page 2019

[18] [18]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

In: NIPS Workshop on Machine Learning for Creativity and Design (2017)

Roberts, A., Engel, J., Eck, D.: Hierarchical variational autoencoders for music. In: NIPS Workshop on Machine Learning for Creativity and Design (2017)

work page 2017

[20] [20]

In: Advances in neural information processing sys- tems

Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in neural information processing sys- tems. pp. 3483–3491 (2015)

work page 2015

[21] [21]

Waite, E., Eck, D., Roberts, A., Abolaﬁa, D.: Project magenta: generating long- term structure in songs and stories (2016)

work page 2016

[22] [22]

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Yang, L.C., Chou, S.Y., Yang, Y.H.: Midinet: A convolutional generative adversar- ial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Inspecting and Interacting with Meaningful Music Representations using VAE

Yang, R., Chen, T., Zhang, Y., Xia, G.: Inspecting and interacting with meaningful music representations using vae. arXiv preprint arXiv:1904.08842 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[24] [24]

In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (2017)

Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: Sequence generative adversarial nets with policy gradient. In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (2017)

work page 2017

[25] [25]

In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Zhu, H., Liu, Q., Yuan, N.J., Qin, C., Li, J., Zhang, K., Zhou, G., Wei, F., Xu, Y., Chen, E.: Xiaoice band: A melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2837–2846. ACM (2018)

work page 2018