pith. sign in

arxiv: 2005.00341 · v1 · pith:YGSCZ6HPnew · submitted 2020-04-30 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Jukebox: A Generative Model for Music

Pith reviewed 2026-05-24 15:16 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML
keywords generative music modelraw audio generationVQ-VAEautoregressive transformerconditioned music generationsinging voice synthesismulti-scale compression
0
0 comments X

The pith

Jukebox generates high-fidelity songs with vocals in raw audio by compressing waveforms into discrete codes and modeling them with autoregressive transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative model that produces music containing singing directly as raw audio waveforms rather than symbolic or MIDI representations. It solves the problem of very long audio sequences by first using a multi-scale vector-quantized variational autoencoder to turn the waveform into a hierarchy of discrete codes, then training large autoregressive transformer models on those codes. The resulting system produces diverse, high-quality songs that remain coherent for multiple minutes and can be steered by conditioning on artist identity, genre labels, and unaligned lyrics. A reader would care because the work shows a concrete path from raw audio data to controllable, minute-scale musical output without intermediate symbolic steps.

Core claim

Jukebox generates music with singing in the raw audio domain by compressing audio with a multi-scale VQ-VAE into discrete codes and modeling those codes with autoregressive Transformers, which at scale produces high-fidelity and diverse songs coherent up to multiple minutes while allowing conditioning on artist, genre, and unaligned lyrics.

What carries the argument

Multi-scale VQ-VAE that compresses raw audio waveforms into a hierarchy of discrete codes at different temporal resolutions, which autoregressive Transformers then model as sequences.

If this is right

  • The model can produce songs lasting multiple minutes that maintain overall structure and style without drifting.
  • Conditioning on artist and genre labels steers both instrumental and vocal characteristics in the generated output.
  • Providing unaligned lyrics improves alignment between generated singing and supplied text without requiring timed annotations.
  • The discrete code representation supports sampling diverse variations while preserving high audio fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compression-plus-autoregressive pipeline could be tested on other long-form audio domains such as speech or environmental sound.
  • Because the codes are discrete, simple operations like code swapping might enable basic music editing or style transfer without retraining.
  • If scaling the transformer component continues to improve sample quality, the approach may generalize to longer contexts or more complex musical forms.
  • Releasing weights and code allows direct measurement of how much the multi-scale hierarchy contributes versus the transformer size alone.

Load-bearing premise

The multi-scale VQ-VAE compression step keeps enough perceptual and structural detail from the original audio that autoregressive modeling of the resulting codes can produce musically coherent output over long durations.

What would settle it

Generate 100 samples conditioned only on artist and genre and measure whether more than half lose melodic or rhythmic coherence before the 60-second mark.

Figures

Figures reproduced from arXiv: 2005.00341 by Alec Radford, Christine Payne, Heewoo Jun, Ilya Sutskever, Jong Wook Kim, Prafulla Dhariwal.

Figure 1
Figure 1. Figure 1: We first train three separate VQ-VAE models with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors ht, which are then quantized to the closest codebook vectors ezt . The code zt is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level lear… view at source ↗
Figure 2
Figure 2. Figure 2: Sampling methods for generating music [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Lyrics-singing alignment learned by one of the encoder￾decoder attention layers. The x-axis is the position of music queries, and the y-axis is the position of lyric keys. The positions attended to by the decoder correspond to the characters being sung. previously generated tokens into the model as inputs and outputting the next token conditioned on all previous tokens. We then run our conditioning wavenet… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of reconstructions from different VQ-VAEs, x-axis is time and y-axis is frequency. The columns from left to right are bottom-, middle-, and top-level reconstructions at hop lengths 8, 32, and 128 respectively, visualized as Mel spectrograms. The first row is the ground-truth, and the second row shows the spectrograms of audio outputs from our VQ-VAE. In the third row, we remove the spectral loss… view at source ↗
Figure 5
Figure 5. Figure 5: Entropy of codebook with 2048 codes, i.e 11 bits, over training. Reviving dead codes near random encoder outputs en￾sures good codebook utilization from the start of training. larger hop sizes. To mitigate codebook collapse, we restart dead codes near random encoder embeddings. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Axis-aligned attention patterns sion completely without loss in training performance. For the Adam state tensors (m_t, v_t) we do dynamic scal￾ing. For each iteration and for every parameter, we rescale its state tensors before casting so that their maximum corre￾sponds to the maximum value of the float16 range, thus max￾imizing the use of the float16 range. Thus, we store the state m_t as the tuple (scale… view at source ↗
Figure 7
Figure 7. Figure 7: Each encoder block consists of a downsampling convolution, a residual network, and a 1D convolution with a kernel size of 3. Dilation is grown by a factor of 3 in these residual networks to increase the receptive field. The decoder block mirrors this exactly with a 1D convolution with the kernel size of 3, a residual network with dilation contracting across depth, and an upsampling transposed con￾volution.… view at source ↗
Figure 8
Figure 8. Figure 8: Detailed architecture of the music prior and upsampler models [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE of (artist, genre) embedding. The overall clustering shows very clearly how genres are related to one another. The broadest of all, pop, is situated in the middle of rock, country, blues, hip hop, and many more. Soundtrack and classical form their own island. Within a genre, we see a similar trend among artists. John Lennon, Paul McCartney, George Harrison and Ringo Starr are clustered around The Bea… view at source ↗
read the original abstract

We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Jukebox, a generative model for raw-audio music with singing. It compresses audio via a multi-scale VQ-VAE into discrete codes and models the codes with autoregressive Transformers. The central claim is that the scaled model produces high-fidelity, diverse songs that remain coherent for multiple minutes and can be steered by artist/genre labels or unaligned lyrics. The authors release thousands of samples, model weights, and code.

Significance. If the coherence and fidelity claims hold under quantitative scrutiny, the work would represent a meaningful advance in long-context audio generation by showing that hierarchical discrete compression plus large-scale autoregressive modeling can sustain musical structure over minute-scale horizons. The public release of weights, code, and non-cherry-picked samples is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Abstract / VQ-VAE architecture] Abstract and the description of the multi-scale VQ-VAE: the central claim that the top-level codes support minute-scale coherence requires that long-range musical attributes (phrase repetition, harmonic arcs, form) survive the extreme temporal compression. No mutual-information analysis, ablation of code-level context, or other direct test of information retention across the VQ-VAE hierarchy is reported, leaving the weakest assumption unexamined.
  2. [Abstract] Abstract: all reported results are qualitative (released samples). No objective metrics, baselines, listening-test protocols, or error analysis are supplied to quantify fidelity or coherence, which is load-bearing for the claim that the combined model 'can generate high-fidelity and diverse songs with coherence up to multiple minutes.'
minor comments (1)
  1. [Abstract] The abstract states that samples are 'non cherry-picked,' but the manuscript does not describe the sampling procedure or selection criteria used to produce the released set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. Below we respond point-by-point to the major comments. Our responses focus on the manuscript as submitted and the evidence it provides.

read point-by-point responses
  1. Referee: [Abstract / VQ-VAE architecture] Abstract and the description of the multi-scale VQ-VAE: the central claim that the top-level codes support minute-scale coherence requires that long-range musical attributes (phrase repetition, harmonic arcs, form) survive the extreme temporal compression. No mutual-information analysis, ablation of code-level context, or other direct test of information retention across the VQ-VAE hierarchy is reported, leaving the weakest assumption unexamined.

    Authors: We agree that direct measurements of information retention (e.g., mutual information between raw audio and top-level codes, or ablations of context at each VQ-VAE level) would strengthen the architectural justification. The submitted manuscript does not contain these analyses; the multi-scale VQ-VAE design is motivated by prior hierarchical compression work and is validated indirectly through the coherence observed in the generated samples. We can add a short discussion of this limitation and the design rationale in a revision, but performing the requested quantitative probes would constitute new experiments beyond the scope of the current submission. revision: partial

  2. Referee: [Abstract] Abstract: all reported results are qualitative (released samples). No objective metrics, baselines, listening-test protocols, or error analysis are supplied to quantify fidelity or coherence, which is load-bearing for the claim that the combined model 'can generate high-fidelity and diverse songs with coherence up to multiple minutes.'

    Authors: The manuscript indeed presents results through qualitative inspection of released samples rather than objective metrics or formal listening tests. Standardized quantitative metrics for long-form musical fidelity and coherence remain an open research problem; we therefore prioritized releasing thousands of non-cherry-picked samples, model weights, and code to enable community evaluation. We acknowledge that this leaves the central claim without the quantitative support the referee requests. A revision could expand the abstract and evaluation section to explicitly state the qualitative nature of the evidence and note the absence of listening-test protocols. revision: partial

Circularity Check

0 steps flagged

No circularity: standard empirical training pipeline with no self-referential derivations

full rationale

The paper describes a multi-scale VQ-VAE for audio compression followed by autoregressive Transformer modeling of the resulting discrete codes, with conditioning on artist/genre/lyrics. No equations, derivations, or claims reduce a result to fitted parameters or self-citations by construction. The central claims rest on empirical training and sampling on external data, with no load-bearing steps that equate outputs to inputs via definition or renaming. The approach is self-contained against external benchmarks (reconstruction fidelity, generation quality) without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5662 in / 1255 out tokens · 33296 ms · 2026-05-24T15:16:44.008396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.

  2. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.

  3. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  4. HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation

    cs.HC 2026-05 unverdicted novelty 7.0

    HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.

  5. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR introduces a phasor-based contrastive framework with learned spectral pooling and complex heads that enforces pitch-equivariant and phase-equivariant biases, delivering up to 70% relative accuracy gains in stem...

  6. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR introduces a contrastive audio representation framework with spectral pooling and complex-valued processing that sets new state-of-the-art results in stem retrieval on MoisesDB, Slakh, and ChocoChorales while a...

  7. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.

  8. ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

    cs.SD 2026-04 unverdicted novelty 7.0

    ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.

  9. Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

    cs.CV 2026-04 unverdicted novelty 7.0

    A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.

  10. Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions

    cs.CV 2026-03 unverdicted novelty 7.0

    An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.

  11. From Daily Song to Daily Self: Supporting Reflective Songwriting of Deaf and Hard-of-Hearing Individuals through Generative Music AI

    cs.HC 2026-03 unverdicted novelty 7.0

    SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.

  12. MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

    cs.SD 2026-02 unverdicted novelty 7.0

    MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.

  13. Finite Scalar Quantization: VQ-VAE Made Simple

    cs.CV 2023-09 conditional novelty 7.0

    Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.

  14. High Fidelity Neural Audio Compression

    eess.AS 2022-10 accept novelty 7.0

    EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...

  15. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  16. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  17. Scaling Laws for Autoregressive Generative Modeling

    cs.LG 2020-10 accept novelty 7.0

    Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

  18. Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

    cs.LG 2026-05 unverdicted novelty 6.0

    A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.

  19. Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

    cs.LG 2026-05 unverdicted novelty 6.0

    An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.

  20. UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

    eess.AS 2026-04 unverdicted novelty 6.0

    UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.

  21. Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

    cs.SD 2026-04 unverdicted novelty 6.0

    Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.

  22. Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning

    cs.HC 2026-04 unverdicted novelty 6.0

    Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.

  23. Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

    cs.SD 2026-04 unverdicted novelty 6.0

    A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.

  24. Two-Dimensional Quantization for Geometry-Aware Audio Coding

    cs.SD 2025-12 unverdicted novelty 6.0

    Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

  25. SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

    cs.SD 2025-05 unverdicted novelty 6.0

    SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.

  26. Not that Groove: Zero-Shot Symbolic Music Editing

    cs.SD 2025-05 unverdicted novelty 6.0

    The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.

  27. GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

    cs.GR 2025-02 unverdicted novelty 6.0

    GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.

  28. GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

    cs.CL 2024-12 conditional novelty 6.0

    GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens be...

  29. Shap-E: Generating Conditional 3D Implicit Functions

    cs.CV 2023-05 accept novelty 6.0

    Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.

  30. Is Conditional Generative Modeling all you need for Decision-Making?

    cs.LG 2022-11 unverdicted novelty 6.0

    Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

  31. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  32. No Language Left Behind: Scaling Human-Centered Machine Translation

    cs.CL 2022-07 unverdicted novelty 6.0

    A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.

  33. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  34. VideoGPT: Video Generation using VQ-VAE and Transformers

    cs.CV 2021-04 accept novelty 6.0

    VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.

  35. Scaling Laws for Transfer

    cs.LG 2021-02 unverdicted novelty 6.0

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  36. Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems

    cs.IR 2026-04 unverdicted novelty 5.0

    Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.

  37. Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

    cs.AI 2026-03 unverdicted novelty 5.0

    Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.

  38. Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

    cs.LG 2026-01 unverdicted novelty 5.0

    Smart Embedding reduces parameters by 48.3 percent in polyphonic music models with information-theoretic loss bounds under 0.153 bits and tighter generalization via Rademacher complexity.

  39. Continuous diffusion for categorical data

    cs.CL 2022-11 unverdicted novelty 5.0

    The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 35 Pith papers · 5 internal anchors

  1. [1]

    Layer Normalization

    Arık, S. Ö., Chen, J., Peng, K., Ping, W., and Zhou, Y . Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems , pp. 10019– 10029. 2018a. Arık, S. Ö., Jun, H., and Diamos, G. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018b. Ba, J. L., Kiro...

  2. [2]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Berner, C., Brockman, G., Chan, B., Cheung, V ., D˛ ebiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

  3. [3]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  4. [4]

    Hierarchi- cal autoregressive image models with auxiliary decoders

    De Fauw, J., Dieleman, S., and Simonyan, K. Hierarchi- cal autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933,

  5. [5]

    NIPS 2016 tutorial: Generative adversarial networks

    Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. In Neural Information Processing Systems , Tutorial,

  6. [6]

    Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models

    Hennequin, R., Khlif, A., V oituret, F., and Moussallam, M. Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models. Late-Breaking/Demo ISMIR 2019, November

  7. [7]

    Axial attention in multidimensional transformers

    Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180,

  8. [8]

    Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499,

  9. [9]

    Ping, W., Peng, K., Gibiansky, A., Arik, S

    URL https: //openai.com/blog/musenet. Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep Voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations,

  10. [10]

    MelNet: A Generative Model for Audio in the Frequency Domain

    Vasquez, S. and Lewis, M. MelNet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083,

  11. [11]

    While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full float precision

    uses recompute with gradient checkpointing, per- forms computations using half precision activations and gradients, and uses dynamic loss scaling. While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full float precision. To scale our models further, we store our matmul parameters ...

  12. [12]

    For example, using seven blocks yields a hop length of 128 for the top level autoencoder

    To get higher compression in time, we simply stack more of these blocks. For example, using seven blocks yields a hop length of 128 for the top level autoencoder. Each residual network has four residual blocks in the mid- dle and top VQ-V AEs resulting in a receptive field of 120 ms and 480 ms for the respective discrete tokens. Because increasing the resi...

  13. [13]

    Detailed architecture of the music prior and upsampler models Jukebox: A Generative Model for Music B.3. Hyperparameters For all Transformers’ residual blocks, we use MLP blocks with the same width as the model width, and attention blocks with queries, keys, and values with width 0.25 times the model width. For all convolutional residual blocks, we use co...

  14. [14]

    VQ-V AE hyperparameters 1B upsamplers Sample length 262144, 65536 Context length 8192 Transformer width 1920 Transformer layers 72 Attention heads 1 Factorized attention shape (128,