Jukebox: A Generative Model for Music

Alec Radford; Christine Payne; Heewoo Jun; Ilya Sutskever; Jong Wook Kim; Prafulla Dhariwal

arxiv: 2005.00341 · v1 · pith:YGSCZ6HPnew · submitted 2020-04-30 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Jukebox: A Generative Model for Music

Prafulla Dhariwal , Heewoo Jun , Christine Payne , Jong Wook Kim , Alec Radford , Ilya Sutskever This is my paper

Pith reviewed 2026-05-24 15:16 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords generative music modelraw audio generationVQ-VAEautoregressive transformerconditioned music generationsinging voice synthesismulti-scale compression

0 comments

The pith

Jukebox generates high-fidelity songs with vocals in raw audio by compressing waveforms into discrete codes and modeling them with autoregressive transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative model that produces music containing singing directly as raw audio waveforms rather than symbolic or MIDI representations. It solves the problem of very long audio sequences by first using a multi-scale vector-quantized variational autoencoder to turn the waveform into a hierarchy of discrete codes, then training large autoregressive transformer models on those codes. The resulting system produces diverse, high-quality songs that remain coherent for multiple minutes and can be steered by conditioning on artist identity, genre labels, and unaligned lyrics. A reader would care because the work shows a concrete path from raw audio data to controllable, minute-scale musical output without intermediate symbolic steps.

Core claim

Jukebox generates music with singing in the raw audio domain by compressing audio with a multi-scale VQ-VAE into discrete codes and modeling those codes with autoregressive Transformers, which at scale produces high-fidelity and diverse songs coherent up to multiple minutes while allowing conditioning on artist, genre, and unaligned lyrics.

What carries the argument

Multi-scale VQ-VAE that compresses raw audio waveforms into a hierarchy of discrete codes at different temporal resolutions, which autoregressive Transformers then model as sequences.

If this is right

The model can produce songs lasting multiple minutes that maintain overall structure and style without drifting.
Conditioning on artist and genre labels steers both instrumental and vocal characteristics in the generated output.
Providing unaligned lyrics improves alignment between generated singing and supplied text without requiring timed annotations.
The discrete code representation supports sampling diverse variations while preserving high audio fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression-plus-autoregressive pipeline could be tested on other long-form audio domains such as speech or environmental sound.
Because the codes are discrete, simple operations like code swapping might enable basic music editing or style transfer without retraining.
If scaling the transformer component continues to improve sample quality, the approach may generalize to longer contexts or more complex musical forms.
Releasing weights and code allows direct measurement of how much the multi-scale hierarchy contributes versus the transformer size alone.

Load-bearing premise

The multi-scale VQ-VAE compression step keeps enough perceptual and structural detail from the original audio that autoregressive modeling of the resulting codes can produce musically coherent output over long durations.

What would settle it

Generate 100 samples conditioned only on artist and genre and measure whether more than half lose melodic or rhythmic coherence before the 60-second mark.

Figures

Figures reproduced from arXiv: 2005.00341 by Alec Radford, Christine Payne, Heewoo Jun, Ilya Sutskever, Jong Wook Kim, Prafulla Dhariwal.

**Figure 1.** Figure 1: We first train three separate VQ-VAE models with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors ht, which are then quantized to the closest codebook vectors ezt . The code zt is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level lear… view at source ↗

**Figure 2.** Figure 2: Sampling methods for generating music [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Lyrics-singing alignment learned by one of the encoderdecoder attention layers. The x-axis is the position of music queries, and the y-axis is the position of lyric keys. The positions attended to by the decoder correspond to the characters being sung. previously generated tokens into the model as inputs and outputting the next token conditioned on all previous tokens. We then run our conditioning wavenet… view at source ↗

**Figure 4.** Figure 4: Comparison of reconstructions from different VQ-VAEs, x-axis is time and y-axis is frequency. The columns from left to right are bottom-, middle-, and top-level reconstructions at hop lengths 8, 32, and 128 respectively, visualized as Mel spectrograms. The first row is the ground-truth, and the second row shows the spectrograms of audio outputs from our VQ-VAE. In the third row, we remove the spectral loss… view at source ↗

**Figure 5.** Figure 5: Entropy of codebook with 2048 codes, i.e 11 bits, over training. Reviving dead codes near random encoder outputs ensures good codebook utilization from the start of training. larger hop sizes. To mitigate codebook collapse, we restart dead codes near random encoder embeddings. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Axis-aligned attention patterns sion completely without loss in training performance. For the Adam state tensors (m_t, v_t) we do dynamic scaling. For each iteration and for every parameter, we rescale its state tensors before casting so that their maximum corresponds to the maximum value of the float16 range, thus maximizing the use of the float16 range. Thus, we store the state m_t as the tuple (scale… view at source ↗

**Figure 7.** Figure 7: Each encoder block consists of a downsampling convolution, a residual network, and a 1D convolution with a kernel size of 3. Dilation is grown by a factor of 3 in these residual networks to increase the receptive field. The decoder block mirrors this exactly with a 1D convolution with the kernel size of 3, a residual network with dilation contracting across depth, and an upsampling transposed convolution.… view at source ↗

**Figure 8.** Figure 8: Detailed architecture of the music prior and upsampler models [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: t-SNE of (artist, genre) embedding. The overall clustering shows very clearly how genres are related to one another. The broadest of all, pop, is situated in the middle of rock, country, blues, hip hop, and many more. Soundtrack and classical form their own island. Within a genre, we see a similar trend among artists. John Lennon, Paul McCartney, George Harrison and Ringo Starr are clustered around The Bea… view at source ↗

read the original abstract

We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jukebox shows a working pipeline for multi-minute raw-audio songs with vocals by stacking multi-scale VQ-VAE compression and large autoregressive Transformers, plus they released the code and samples.

read the letter

The main takeaway is that the system produces listenable multi-minute tracks with singing by first compressing raw audio into discrete codes at several temporal scales with a VQ-VAE, then modeling those codes autoregressively with Transformers while conditioning on artist, genre, and unaligned lyrics. This specific integration at the reported scale for controllable singing is the concrete advance over earlier separate uses of VQ-VAE or Transformers on audio. Releasing thousands of samples, the model weights, and the code is the clearest practical contribution; anyone can check the outputs directly instead of relying on claims alone. The engineering required to train these models on long audio sequences is non-trivial and the public artifacts make the work usable right away. The soft spot is evaluation. The abstract and stress-test note both point to the same gap: no quantitative metrics, baselines, or direct tests of whether the top-level codes retain phrase-level or harmonic structure across minutes. The multi-scale VQ-VAE is optimized for local reconstruction, so it is reasonable to ask how much global musical information survives the extreme compression before the Transformer sees it. Without an ablation on code context length or a probe for long-horizon attributes, the coherence in the samples could partly reflect local continuity or selection rather than learned long-range modeling. This paper is aimed at people building generative audio systems or scaling sequence models to long contexts. A reader who needs to see how conditioning and multi-scale discretization are handled in practice will find concrete choices to examine. It deserves peer review because the scale and the released artifacts give the community something concrete to build on, even if the quantitative backing needs strengthening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Jukebox, a generative model for raw-audio music with singing. It compresses audio via a multi-scale VQ-VAE into discrete codes and models the codes with autoregressive Transformers. The central claim is that the scaled model produces high-fidelity, diverse songs that remain coherent for multiple minutes and can be steered by artist/genre labels or unaligned lyrics. The authors release thousands of samples, model weights, and code.

Significance. If the coherence and fidelity claims hold under quantitative scrutiny, the work would represent a meaningful advance in long-context audio generation by showing that hierarchical discrete compression plus large-scale autoregressive modeling can sustain musical structure over minute-scale horizons. The public release of weights, code, and non-cherry-picked samples is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract / VQ-VAE architecture] Abstract and the description of the multi-scale VQ-VAE: the central claim that the top-level codes support minute-scale coherence requires that long-range musical attributes (phrase repetition, harmonic arcs, form) survive the extreme temporal compression. No mutual-information analysis, ablation of code-level context, or other direct test of information retention across the VQ-VAE hierarchy is reported, leaving the weakest assumption unexamined.
[Abstract] Abstract: all reported results are qualitative (released samples). No objective metrics, baselines, listening-test protocols, or error analysis are supplied to quantify fidelity or coherence, which is load-bearing for the claim that the combined model 'can generate high-fidelity and diverse songs with coherence up to multiple minutes.'

minor comments (1)

[Abstract] The abstract states that samples are 'non cherry-picked,' but the manuscript does not describe the sampling procedure or selection criteria used to produce the released set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. Below we respond point-by-point to the major comments. Our responses focus on the manuscript as submitted and the evidence it provides.

read point-by-point responses

Referee: [Abstract / VQ-VAE architecture] Abstract and the description of the multi-scale VQ-VAE: the central claim that the top-level codes support minute-scale coherence requires that long-range musical attributes (phrase repetition, harmonic arcs, form) survive the extreme temporal compression. No mutual-information analysis, ablation of code-level context, or other direct test of information retention across the VQ-VAE hierarchy is reported, leaving the weakest assumption unexamined.

Authors: We agree that direct measurements of information retention (e.g., mutual information between raw audio and top-level codes, or ablations of context at each VQ-VAE level) would strengthen the architectural justification. The submitted manuscript does not contain these analyses; the multi-scale VQ-VAE design is motivated by prior hierarchical compression work and is validated indirectly through the coherence observed in the generated samples. We can add a short discussion of this limitation and the design rationale in a revision, but performing the requested quantitative probes would constitute new experiments beyond the scope of the current submission. revision: partial
Referee: [Abstract] Abstract: all reported results are qualitative (released samples). No objective metrics, baselines, listening-test protocols, or error analysis are supplied to quantify fidelity or coherence, which is load-bearing for the claim that the combined model 'can generate high-fidelity and diverse songs with coherence up to multiple minutes.'

Authors: The manuscript indeed presents results through qualitative inspection of released samples rather than objective metrics or formal listening tests. Standardized quantitative metrics for long-form musical fidelity and coherence remain an open research problem; we therefore prioritized releasing thousands of non-cherry-picked samples, model weights, and code to enable community evaluation. We acknowledge that this leaves the central claim without the quantitative support the referee requests. A revision could expand the abstract and evaluation section to explicitly state the qualitative nature of the evidence and note the absence of listening-test protocols. revision: partial

Circularity Check

0 steps flagged

No circularity: standard empirical training pipeline with no self-referential derivations

full rationale

The paper describes a multi-scale VQ-VAE for audio compression followed by autoregressive Transformer modeling of the resulting discrete codes, with conditioning on artist/genre/lyrics. No equations, derivations, or claims reduce a result to fitted parameters or self-citations by construction. The central claims rest on empirical training and sampling on external data, with no load-bearing steps that equate outputs to inputs via definition or renaming. The approach is self-contained against external benchmarks (reconstruction fidelity, generation quality) without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5662 in / 1255 out tokens · 33296 ms · 2026-05-24T15:16:44.008396+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use an autoregressive Sparse Transformer... context of 8192 tokens... hop lengths 8, 32, 128

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a phasor-based contrastive framework with learned spectral pooling and complex heads that enforces pitch-equivariant and phase-equivariant biases, delivering up to 70% relative accuracy gains in stem...
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a contrastive audio representation framework with spectral pooling and complex-valued processing that sets new state-of-the-art results in stem retrieval on MoisesDB, Slakh, and ChocoChorales while a...
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
cs.SD 2026-04 unverdicted novelty 7.0

ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
cs.CV 2026-04 unverdicted novelty 7.0

A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.
Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions
cs.CV 2026-03 unverdicted novelty 7.0

An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.
From Daily Song to Daily Self: Supporting Reflective Songwriting of Deaf and Hard-of-Hearing Individuals through Generative Music AI
cs.HC 2026-03 unverdicted novelty 7.0

SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
cs.SD 2026-02 unverdicted novelty 7.0

MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
Finite Scalar Quantization: VQ-VAE Made Simple
cs.CV 2023-09 conditional novelty 7.0

Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.
High Fidelity Neural Audio Compression
eess.AS 2022-10 accept novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
eess.AS 2026-04 unverdicted novelty 6.0

UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
cs.SD 2026-04 unverdicted novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning
cs.HC 2026-04 unverdicted novelty 6.0

Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP
cs.SD 2026-04 unverdicted novelty 6.0

A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
Two-Dimensional Quantization for Geometry-Aware Audio Coding
cs.SD 2025-12 unverdicted novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization
cs.SD 2025-05 unverdicted novelty 6.0

SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
Not that Groove: Zero-Shot Symbolic Music Editing
cs.SD 2025-05 unverdicted novelty 6.0

The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation
cs.GR 2025-02 unverdicted novelty 6.0

GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
cs.CL 2024-12 conditional novelty 6.0

GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens be...
Shap-E: Generating Conditional 3D Implicit Functions
cs.CV 2023-05 accept novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
Is Conditional Generative Modeling all you need for Decision-Making?
cs.LG 2022-11 unverdicted novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
No Language Left Behind: Scaling Human-Centered Machine Translation
cs.CL 2022-07 unverdicted novelty 6.0

A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
cs.IR 2026-04 unverdicted novelty 5.0

Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
cs.AI 2026-03 unverdicted novelty 5.0

Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias
cs.LG 2026-01 unverdicted novelty 5.0

Smart Embedding reduces parameters by 48.3 percent in polyphonic music models with information-theoretic loss bounds under 0.153 bits and tighter generalization via Rademacher complexity.
Continuous diffusion for categorical data
cs.CL 2022-11 unverdicted novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 35 Pith papers · 5 internal anchors

[1]

Layer Normalization

Arık, S. Ö., Chen, J., Peng, K., Ping, W., and Zhou, Y . Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems , pp. 10019– 10029. 2018a. Arık, S. Ö., Jun, H., and Diamos, G. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018b. Ba, J. L., Kiro...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Dota 2 with Large Scale Deep Reinforcement Learning

Berner, C., Brockman, G., Chan, B., Cheung, V ., D˛ ebiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[3]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[4]

Hierarchi- cal autoregressive image models with auxiliary decoders

De Fauw, J., Dieleman, S., and Simonyan, K. Hierarchi- cal autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933,

work page arXiv 1903
[5]

NIPS 2016 tutorial: Generative adversarial networks

Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. In Neural Information Processing Systems , Tutorial,

work page 2016
[6]

Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models

Hennequin, R., Khlif, A., V oituret, F., and Moussallam, M. Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models. Late-Breaking/Demo ISMIR 2019, November

work page 2019
[7]

Axial attention in multidimensional transformers

Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180,

work page arXiv 1912
[8]

Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Ping, W., Peng, K., Gibiansky, A., Arik, S

URL https: //openai.com/blog/musenet. Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep Voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations,

work page 2000
[10]

MelNet: A Generative Model for Audio in the Frequency Domain

Vasquez, S. and Lewis, M. MelNet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[11]

While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full ﬂoat precision

uses recompute with gradient checkpointing, per- forms computations using half precision activations and gradients, and uses dynamic loss scaling. While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full ﬂoat precision. To scale our models further, we store our matmul parameters ...

work page 2019
[12]

For example, using seven blocks yields a hop length of 128 for the top level autoencoder

To get higher compression in time, we simply stack more of these blocks. For example, using seven blocks yields a hop length of 128 for the top level autoencoder. Each residual network has four residual blocks in the mid- dle and top VQ-V AEs resulting in a receptive ﬁeld of 120 ms and 480 ms for the respective discrete tokens. Because increasing the resi...

work page 2019
[13]

Detailed architecture of the music prior and upsampler models Jukebox: A Generative Model for Music B.3. Hyperparameters For all Transformers’ residual blocks, we use MLP blocks with the same width as the model width, and attention blocks with queries, keys, and values with width 0.25 times the model width. For all convolutional residual blocks, we use co...

work page 2048
[14]

VQ-V AE hyperparameters 1B upsamplers Sample length 262144, 65536 Context length 8192 Transformer width 1920 Transformer layers 72 Attention heads 1 Factorized attention shape (128,

work page 1920

[1] [1]

Layer Normalization

Arık, S. Ö., Chen, J., Peng, K., Ping, W., and Zhou, Y . Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems , pp. 10019– 10029. 2018a. Arık, S. Ö., Jun, H., and Diamos, G. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018b. Ba, J. L., Kiro...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Dota 2 with Large Scale Deep Reinforcement Learning

Berner, C., Brockman, G., Chan, B., Cheung, V ., D˛ ebiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[3] [3]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[4] [4]

Hierarchi- cal autoregressive image models with auxiliary decoders

De Fauw, J., Dieleman, S., and Simonyan, K. Hierarchi- cal autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933,

work page arXiv 1903

[5] [5]

NIPS 2016 tutorial: Generative adversarial networks

Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. In Neural Information Processing Systems , Tutorial,

work page 2016

[6] [6]

Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models

Hennequin, R., Khlif, A., V oituret, F., and Moussallam, M. Spleeter: A fast and state-of-the art music source separa- tion tool with pre-trained models. Late-Breaking/Demo ISMIR 2019, November

work page 2019

[7] [7]

Axial attention in multidimensional transformers

Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180,

work page arXiv 1912

[8] [8]

Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Ping, W., Peng, K., Gibiansky, A., Arik, S

URL https: //openai.com/blog/musenet. Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep Voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations,

work page 2000

[10] [10]

MelNet: A Generative Model for Audio in the Frequency Domain

Vasquez, S. and Lewis, M. MelNet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[11] [11]

While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full ﬂoat precision

uses recompute with gradient checkpointing, per- forms computations using half precision activations and gradients, and uses dynamic loss scaling. While this speeds up training on V olta cores, one still has a high memory us- age from storing the parameters and Adam state in full ﬂoat precision. To scale our models further, we store our matmul parameters ...

work page 2019

[12] [12]

For example, using seven blocks yields a hop length of 128 for the top level autoencoder

To get higher compression in time, we simply stack more of these blocks. For example, using seven blocks yields a hop length of 128 for the top level autoencoder. Each residual network has four residual blocks in the mid- dle and top VQ-V AEs resulting in a receptive ﬁeld of 120 ms and 480 ms for the respective discrete tokens. Because increasing the resi...

work page 2019

[13] [13]

Detailed architecture of the music prior and upsampler models Jukebox: A Generative Model for Music B.3. Hyperparameters For all Transformers’ residual blocks, we use MLP blocks with the same width as the model width, and attention blocks with queries, keys, and values with width 0.25 times the model width. For all convolutional residual blocks, we use co...

work page 2048

[14] [14]

VQ-V AE hyperparameters 1B upsamplers Sample length 262144, 65536 Context length 8192 Transformer width 1920 Transformer layers 72 Attention heads 1 Factorized attention shape (128,

work page 1920