Autoregressive Diffusion Models

Alexey A. Gritsenko; Ben Poole; Emiel Hoogeboom; Jasmijn Bastings; Rianne van den Berg; Tim Salimans

arxiv: 2110.02037 · v2 · pith:BSW7MT2Xnew · submitted 2021-10-05 · 💻 cs.LG · stat.ML

Autoregressive Diffusion Models

Emiel Hoogeboom , Alexey A. Gritsenko , Jasmijn Bastings , Ben Poole , Rianne van den Berg , Tim Salimans This is my paper

classification 💻 cs.LG stat.ML

keywords ardmsdiffusionmodelsautoregressivegenerationmodelcompressiondata

0 comments

read the original abstract

We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
cs.LG 2026-07 unverdicted novelty 7.0

Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
Unified Energy for Invariant and Independent Decoding in Diffusion Language Models
cs.CL 2026-06 unverdicted novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.
Dynamic Chunking for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-02 unverdicted novelty 7.0

Discrete Stochastic Localization lets a single trained network support an entire family of per-token SNR paths for discrete sequence generation, with masked diffusion as a special case, and improves MAUVE scores when ...
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 6.0

DSL provides a continuous embedding framework where one denoiser supports a family of SNR paths for discrete sequences, improving MAUVE scores on OpenWebText and allowing random-order and hybrid sampling from a fine-t...
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation
cs.CV 2024-11 unverdicted novelty 5.0

SOW uses MLLMs and attention to selectively control unidirectional diffusion for pixel-level fidelity and contextual coherence in text-vision-to-image tasks.
Continuous diffusion for categorical data
cs.CL 2022-11 unverdicted novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
A Mathematical Introduction to Diffusion Models
cs.LG 2026-07 unverdicted

An educational exposition that layers core definitions, simplified estimates, and research-level theorems on diffusion sampling for probability-background graduate students.