Neural Discrete Representation Learning

Aaron van den Oord , Oriol Vinyals , Koray Kavukcuoglu

Authors on Pith no claims yet

classification 💻 cs.LG

keywords discretelearningmodelrepresentationsautoregressivehighlearntpowerful

read the original abstract

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
hep-ph 2026-04 unverdicted novelty 7.0

The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
cs.LG 2026-04 unverdicted novelty 7.0

Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by d...
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Network-Efficient World Model Token Streaming
cs.RO 2026-05 unverdicted novelty 6.0

An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
cs.DL 2026-03 accept novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
cs.CV 2022-11 unverdicted novelty 6.0

Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows
cs.CV 2026-05 unverdicted novelty 5.0

PixelFlowCast delivers high-fidelity precipitation nowcasts from radar sequences using a latent-free Pixel Mean Flows predictor guided by a deterministic coarse stage and KANCondNet features.
SID-Coord: Coordinating Semantic IDs for ID-based Ranking in Short-Video Search
cs.IR 2026-04 unverdicted novelty 5.0

SID-Coord coordinates semantic IDs with hashed item IDs via attention fusion, adaptive gating, and interest alignment, yielding +0.664% long-play rate and +0.369% playback duration gains in production search ranking.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.