Continuous diffusion for categorical data

Arman Roshannai; Arnaud Doucet; Chris Dyer; Conor Durkan; Curtis Hawthorne; Jonas Adler; Laurent Sartran; Nikolay Savinov; Pierre H. Richemond; R\'emi Leblond

REVIEW 1 major objections 1 minor 43 cited by

Categorical data such as language can be generated with diffusion models by first embedding tokens into a continuous vector space.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-18 03:25 UTC pith:SW3KZRUK

load-bearing objection CDCD runs standard Gaussian diffusion on learned token embeddings then decodes back to categories, which produces usable language samples but leaves the discretization error unanalyzed. the 1 major comments →

arxiv 2211.15089 v3 pith:SW3KZRUK submitted 2022-11-28 cs.CL cs.LG

Continuous diffusion for categorical data

Sander Dieleman , Laurent Sartran , Arman Roshannai , Nikolay Savinov , Yaroslav Ganin , Pierre H. Richemond , Arnaud Doucet , Robin Strudel

show 6 more authors

Chris Dyer Conor Durkan Curtis Hawthorne R\'emi Leblond Will Grathwohl Jonas Adler

This is my paper

classification cs.CL cs.LG

keywords diffusion modelscategorical datalanguage modelingcontinuous diffusiongenerative modelsembeddingsCDCD

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to keep the core strengths of diffusion models—iterative refinement through a continuous process—when moving from perceptual signals like images to discrete data like text. Instead of inventing new discrete diffusion steps, it embeds each categorical token into a vector and then runs ordinary Gaussian diffusion over continuous time in that space. A sympathetic reader would care because this keeps the same smooth denoising trajectory and training dynamics that made diffusion models dominant for continuous domains, while still producing valid categorical outputs at the end. If the approach holds, generative modeling no longer needs to split into separate continuous and discrete families.

Core claim

We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. Discrete tokens are first mapped to continuous vectors; standard Gaussian diffusion is then performed in this embedding space, allowing the same continuous-time refinement procedure used for images and audio to operate on language and other categorical sequences. The resulting models are shown to perform effectively on several language modelling tasks.

What carries the argument

The CDCD framework, which maps categorical tokens to continuous vectors and applies Gaussian diffusion continuously in both time and that vector space.

Load-bearing premise

Embedding discrete categorical tokens into a continuous vector space and running standard Gaussian diffusion there preserves the benefits of continuous diffusion without introducing new modeling errors that dominate performance.

What would settle it

Train a CDCD model and a strong discrete-diffusion baseline on the same language dataset and measure generation quality or perplexity; if the continuous-embedding version shows no advantage or clear degradation traceable to the embedding step, the central claim is falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Language modeling tasks can be solved by continuous-time iterative refinement rather than discrete token transitions.
Standard Gaussian diffusion machinery transfers directly once tokens are embedded, avoiding the need for custom discrete noise schedules.
The same framework applies across multiple language modeling benchmarks without changing the underlying diffusion process.
Generation remains a gradual denoising trajectory that can be stopped at any continuous time point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may allow a single diffusion backbone to handle mixed continuous and categorical inputs in multimodal settings.
Continuous-time sampling could be tuned more finely than discrete-step schedules, potentially improving speed-quality trade-offs in text generation.
Similar embedding-plus-diffusion pipelines might extend naturally to other categorical sequences such as source code or biological strings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

CDCD runs standard Gaussian diffusion on learned token embeddings then decodes back to categories, which produces usable language samples but leaves the discretization error unanalyzed.

read the letter

The core move is to map categorical tokens to continuous vectors, run ordinary diffusion there, and map back at the end. That keeps the sampling continuous in both time and space, which is the main technical claim. The experiments show the approach can train and sample on language tasks without collapsing, and the results are at least competitive with some discrete diffusion baselines on the reported metrics. Credit for actually shipping working code and numbers instead of just the abstract idea. The adaptation of the loss and the embedding training look like standard diffusion machinery with a straightforward categorical wrapper, which is fine as far as it goes. The paper is clear about what it is trying to do and does not overclaim the gains. The soft spot is exactly the one the stress-test flagged: there is no bound or measurement of how much error the final discretization step introduces, nor any controlled test showing that the continuous diffusion itself is doing more work than the choice of embedding space. If the embeddings already separate the categories well, the diffusion may be adding little beyond what a simpler autoregressive or discrete model could do. The results could be driven more by embedding quality than by the diffusion process, and without that separation the continuous benefit is hard to credit. This is the kind of paper that belongs in a reading group for people who follow diffusion extensions to discrete domains. It is worth a serious referee because the implementation is reproducible and the central engineering question is well-posed, even if the current evidence does not yet prove the continuous advantage is load-bearing. I would send it out for review with a request for tighter analysis of the embedding-to-token mapping error.

Referee Report

1 major / 1 minor

Summary. The paper proposes CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space by embedding discrete tokens into a continuous vector space and applying Gaussian diffusion, and demonstrates its efficacy on several language modelling tasks.

Significance. If the central assumption holds—that Gaussian diffusion on learned token embeddings can recover categorical structure upon decoding without dominant approximation error—this work could significantly advance generative modeling for discrete data by preserving the benefits of continuous diffusion models, such as iterative refinement, for applications like language modeling.

major comments (1)

[§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.

minor comments (1)

The abstract could more explicitly state the specific language modeling tasks used for demonstration.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses

Referee: [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.

Authors: We agree that a formal bound on discretization error would strengthen the presentation. The manuscript currently relies on the fact that the diffusion process itself operates entirely in continuous space and time, with the embedding learned jointly so that the decoder (nearest-neighbor or softmax) recovers coherent categorical sequences, as demonstrated by the language-modeling results. Deriving a tight, non-vacuous bound is non-trivial because the embedding is data-dependent and optimized end-to-end; any such bound would necessarily involve assumptions on the Lipschitz constant of the embedding map and the concentration of the learned representations. In the revised version we will add a short subsection in §3.1 that (i) explicitly states the decoding step as a post-processing operation that does not alter the continuous nature of the forward and reverse processes, (ii) reports empirical reconstruction error (continuous vector to nearest token) on held-out embeddings as a function of embedding dimension, and (iii) discusses why the observed error does not dominate the generative performance relative to purely discrete baselines. We believe this addresses the concern without requiring an intractable theoretical bound. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained with no reductions to fitted inputs or self-citations

full rationale

The provided abstract and context describe a proposed framework CDCD that embeds categorical tokens into continuous space for Gaussian diffusion, with no equations, fitting procedures, or self-citations presented that would allow any claimed result to reduce to its inputs by construction. The central claim of preserving continuous diffusion benefits for discrete data is stated as a modeling choice and empirical demonstration rather than a derived prediction forced by prior definitions or author-specific uniqueness theorems. No load-bearing steps match the enumerated circularity patterns, as the text contains no explicit parameter fitting renamed as prediction, ansatz smuggling, or renaming of known results. This is the expected honest non-finding for a high-level proposal without detailed derivation visible in the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the unstated assumption that a continuous embedding of discrete tokens can be denoised with standard diffusion dynamics and then mapped back without loss of the original categorical structure.

pith-pipeline@v0.9.0 · 5663 in / 1036 out tokens · 18730 ms · 2026-05-18T03:25:47.881749+00:00 · methodology

0 comments

read the original abstract

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
cs.LG 2026-07 unverdicted novelty 7.0

Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
Self-conditioned Flow Map Language Models via Fixed-point Flows
cs.CL 2026-07 unverdicted novelty 7.0

Self-conditioned flow language models solve fixed-point iterations, enabling fixed-point flow maps that distill into FMLM* which outperforms SOTA in few-step generation on OpenWebText.
Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs
cs.CL 2026-07 unverdicted novelty 7.0

Low Gen-PPL in continuous diffusion LMs results from repetition caused by a 1D contractive attractor in self-conditioning feedback; ACE subtracts the direction to reduce repetition to human levels while preserving quality.
Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics
cs.CL 2026-06 accept novelty 7.0

Naive samplers beat published diffusion and flow models on gen-PPL with incoherent output, proving the metric unsound and motivating distributional evaluation suites.
Variational Learning for Insertion-based Generation
cs.LG 2026-06 unverdicted novelty 7.0

Introduces the Insertion Process model for variable-length non-monotonic sequence generation via a bijective permutation mapping and permutation-based variational inference.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 7.0

The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-02 unverdicted novelty 7.0

Discrete Stochastic Localization lets a single trained network support an entire family of per-token SNR paths for discrete sequence generation, with masked diffusion as a special case, and improves MAUVE scores when ...
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Diffusion and Flow Matching Models for Tabular Data: A Survey
cs.LG 2025-02 unverdicted novelty 7.0

First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
Subliminal Clocks: Latent Time Modelling in Diffusion Language Models
cs.AI 2026-07 unverdicted novelty 6.0

DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.
Posterior Refinement: Fast Language Generation via Any-Order Flow Maps
cs.CL 2026-06 unverdicted novelty 6.0

FMLM+ with Posterior Refinement bridges masked diffusion and flow map models to match discrete baseline quality in language generation using 32x fewer neural function evaluations via posterior scoring and refinement.
Modular Diffusion Models for Structured Visual Recognition
cs.CV 2026-06 unverdicted novelty 6.0

Modular Diffusion Models decompose diffusion into task-specific modules to model distributions over structured visual outputs for detection, segmentation, and scene graph generation.
Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling
q-bio.BM 2026-06 unverdicted novelty 6.0

GeoCoupling optimizes temporal couplings between modalities in biomolecular generative models and outperforms synchronous baselines on drug design and protein design tasks.
DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs
cs.CL 2026-05 unverdicted novelty 6.0

Adapting LLaDA-8B-Instruct via Discrete Stochastic Localization with continuous per-token Gaussian noise yields continuous denoising that achieves top ROUGE-1 on zero-shot summarization at low step budgets and adds se...
Fixed-Point Masked Generative Modeling
cs.LG 2026-05 unverdicted novelty 6.0

FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
cs.LG 2026-05 unverdicted novelty 6.0

DiLaDiff augments masked diffusion LMs with latent space modeling and consistency distillation to improve token correlation capture and inference speed.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
cs.CL 2026-05 conditional novelty 6.0

RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 6.0

DSL provides a continuous embedding framework where one denoiser supports a family of SNR paths for discrete sequences, improving MAUVE scores on OpenWebText and allowing random-order and hybrid sampling from a fine-t...
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF applies continuous-time flow matching in embedding space for language generation and reports outperforming prior discrete and continuous diffusion language models with fewer steps.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 conditional novelty 6.0

Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
Dream 7B: Diffusion Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Logit-KL Flow Matching: Non-Autoregressive Text Generation via Sampling-Hybrid Inference
cs.CL 2024-11 unverdicted novelty 6.0

Logit-KL Flow Matching recovers the flow-matching velocity field from conditional likelihood maximization and uses iterative denoise-re-noise sampling to improve perplexity and downstream metrics over prior NAR baseli...
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
cs.CL 2024-10 conditional novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on langua...
Reinforcement Learning from Denoising Feedback
cs.CL 2026-05 unverdicted novelty 5.0

RLDF is a new RL paradigm for diffusion language models that optimizes toward clipped clean states with weighted timestep sampling and reports substantial gains on reasoning benchmarks for LLaDA and Dream.
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and vid...
When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation
cs.CL 2026-05 unverdicted novelty 5.0

Latent geometry metrics fail to ensure good token decoding in non-autoregressive text models; decoder recoverability and start distribution quality are the necessary evaluation criteria.
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 5.0

CDLM introduces MPDC training for discrete diffusion models, recovering prior methods as limits and claiming new SOTA text generation performance especially at low sampling budgets.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 37 Pith papers · 17 internal anchors

[1]

Abadi, P

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation ( OSDI 16) , pages 265--283, 2016

work page 2016
[2]

Austin, D

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34: 0 17981--17993, 2021

work page 2021
[3]

Babuschkin, K

I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, T. Cai, A. Clark, I. Danihelka, C. Fantacci, J. Godwin, C. Jones, T. Hennigan, M. Hessel, S. Kapturowski, T. Keck, I. Kemaev, M. King, L. Martens, V. Mikulik, T. Norman, J. Quan, G. Papamakarios, R. Ring, F. Ruiz, A. Sanchez, R. Schneider, E. Sezener, S. Spencer, S. Sri...

work page 2020
[4]

Efficient Training of Language Models to Fill in the Middle

M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM : a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022

work page arXiv 2022
[6]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018

work page 2018
[7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[8]

Campbell, J

A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. arXiv preprint arXiv:2205.14987, 2022

work page arXiv 2022
[9]

W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly. Imputer: Sequence modelling via imputation and dynamic programming. In International Conference on Machine Learning, pages 1403--1413. PMLR, 2020

work page 2020
[10]

Chang, H

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315--11325, 2022

work page 2022
[11]

T. Chen, R. Zhang, and G. Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022

work page Pith review arXiv 2022
[12]

J. C. Cox, J. E. Ingersoll Jr, and S. A. Ross. A theory of the term structure of interest rates. Econometrica, 2: 0 385--407, 1985

work page 1985
[13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[15]

arXiv preprint arXiv:2103.06089 , year=

S. Dieleman, C. Nash, J. Engel, and K. Simonyan. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021

work page arXiv 2021
[16]

Donahue, I

C. Donahue, I. Simon, and S. Dieleman. Piano genie. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 160--164, 2019

work page 2019
[17]

Durkan, A

C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

work page 2019
[18]

Eikema and W

B. Eikema and W. Aziz. Sampling-based approximations to minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718, 2021

work page arXiv 2021
[19]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873--12883, 2021

work page 2021
[20]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112--6121, 2019

work page 2019
[21]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, and L. Zettlemoyer. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020

work page arXiv 2001
[22]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

work page 2014
[23]

Goyal, C

K. Goyal, C. Dyer, and T. Berg-Kirkpatrick. Exposing the implicit energy networks behind masked language models via metropolis--hastings. arXiv preprint arXiv:2106.02736, 2021

work page arXiv 2021
[24]

J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

J. Gu, C. Wang, and J. Zhao. Levenshtein transformer. arXiv preprint arXiv:1905.11006, 2019

work page arXiv 1905
[26]

X. Han, S. Kumar, and Y. Tsvetkov. SSD-LM : Semi-autoregressive simplex-based diffusion language model for text generation and modular control, 2022

work page 2022
[27]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ' o, M. Wiebe, P. Peterson, P. G ' e rard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array prog...

work page doi:10.1038/s41586-020-2649-2 2020
[28]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

work page 2020
[30]

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b

work page 2022
[32]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019
[33]

Hoogeboom, A

E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021 a

work page arXiv 2021
[34]

Hoogeboom, D

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr \'e , and M. Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 0 12454--12465, 2021 b

work page 2021
[35]

X. S. Huang, F. Perez, and M. Volkovs. Improving non-autoregressive translation models without distillation. In International Conference on Learning Representations, 2022

work page 2022
[36]

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9 0 (3): 0 90--95, 2007

work page 2007
[37]

Hyv \"a rinen and P

A. Hyv \"a rinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

work page 2005
[38]

Jaegle, F

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651--4664. PMLR, 2021

work page 2021
[39]

Jayaram and J

V. Jayaram and J. Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In International Conference on Machine Learning, pages 4807--4818. PMLR, 2021

work page 2021
[40]

Kaiser, S

L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pages 2390--2399. PMLR, 2018

work page 2018
[41]

Elucidating the Design Space of Diffusion-Based Generative Models

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Kasai, J

J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu. Non-autoregressive machine translation with disentangled context transformer. In International Conference on Machine Learning, pages 5144--5155. PMLR, 2020 a

work page 2020
[43]

Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation.arXiv preprint arXiv:2006.10369, 2020

J. Kasai, N. Pappas, H. Peng, J. Cross, and N. A. Smith. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020 b

work page arXiv 2006
[44]

P. Kidger. O n N eural D ifferential E quations . PhD thesis, University of Oxford, 2021

work page 2021
[45]

Sequence-Level Knowledge Distillation

Y. Kim and A. M. Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Kingma, T

D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

work page 2021
[47]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

work page 2015
[48]

D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018

work page 2018
[49]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[50]

X. Kong, Z. Zhang, and E. Hovy. Incorporating a local translation mechanism into non-autoregressive translation. arXiv preprint arXiv:2011.06132, 2020

work page arXiv 2011
[51]

T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[52]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Kumar and W

S. Kumar and W. Byrne. Minimum B ayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004 , pages 169--176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics

work page 2004
[54]

J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022

work page arXiv 2022
[56]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

work page Pith review arXiv 2022
[57]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

C. Meng, K. Choi, J. Song, and S. Ermon. Concrete score matching: Generalized score matching for discrete data. arXiv preprint arXiv:2211.00802, 2022

work page arXiv 2022
[59]

M \"u ller, B

T. M \"u ller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov \'a k. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 0 (5): 0 1--19, 2019

work page 2019
[60]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162--8171. PMLR, 2021

work page 2021
[61]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002

work page 2002
[62]

Perez, F

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[63]

Pillutla, S

K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In NeurIPS, 2021

work page 2021
[64]

M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics

work page 2018
[65]

Press and L

O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain, Apr. 2017. Association for Computational Linguistics

work page 2017
[66]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019
[68]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

M. Reid, V. J. Hellendoorn, and G. Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

work page 2022
[70]

D. J. Rezende and F. Viola. Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, NeurIPS, 2018

work page 2018
[71]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278--1286. PMLR, 2014

work page 2014
[72]

P. H. Richemond, S. Dieleman, and A. Doucet. Categorical sdes with simplex diffusion. arXiv preprint arXiv:2210.14784, 2022

work page arXiv 2022
[73]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022

work page 2022
[74]

Saharia, W

C. Saharia, W. Chan, S. Saxena, and M. Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020

work page arXiv 2004
[75]

Saharia, W

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1--10, 2022 a

work page 2022
[76]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

, author Chung, J

N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021

work page arXiv 2021
[78]

A. Shih, D. Sadigh, and S. Ermon. Training and inference on any-order autoregressive models the right way. arXiv preprint arXiv:2205.13554, 2022

work page arXiv 2022
[79]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256--2265. PMLR, 2015

work page 2015
[80]

Song and S

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

Showing first 80 references.

[1] [1]

Abadi, P

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation ( OSDI 16) , pages 265--283, 2016

work page 2016

[2] [2]

Austin, D

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34: 0 17981--17993, 2021

work page 2021

[3] [3]

Babuschkin, K

I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, T. Cai, A. Clark, I. Danihelka, C. Fantacci, J. Godwin, C. Jones, T. Hennigan, M. Hessel, S. Kapturowski, T. Keck, I. Kemaev, M. King, L. Martens, V. Mikulik, T. Norman, J. Quan, G. Papamakarios, R. Ring, F. Ruiz, A. Sanchez, R. Schneider, E. Sezener, S. Spencer, S. Sri...

work page 2020

[4] [4]

Efficient Training of Language Models to Fill in the Middle

M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM : a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022

work page arXiv 2022

[6] [6]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018

work page 2018

[7] [7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901

[8] [8]

Campbell, J

A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. arXiv preprint arXiv:2205.14987, 2022

work page arXiv 2022

[9] [9]

W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly. Imputer: Sequence modelling via imputation and dynamic programming. In International Conference on Machine Learning, pages 1403--1413. PMLR, 2020

work page 2020

[10] [10]

Chang, H

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315--11325, 2022

work page 2022

[11] [11]

T. Chen, R. Zhang, and G. Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022

work page Pith review arXiv 2022

[12] [12]

J. C. Cox, J. E. Ingersoll Jr, and S. A. Ross. A theory of the term structure of interest rates. Econometrica, 2: 0 385--407, 1985

work page 1985

[13] [13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[15] [15]

arXiv preprint arXiv:2103.06089 , year=

S. Dieleman, C. Nash, J. Engel, and K. Simonyan. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021

work page arXiv 2021

[16] [16]

Donahue, I

C. Donahue, I. Simon, and S. Dieleman. Piano genie. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 160--164, 2019

work page 2019

[17] [17]

Durkan, A

C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

work page 2019

[18] [18]

Eikema and W

B. Eikema and W. Aziz. Sampling-based approximations to minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718, 2021

work page arXiv 2021

[19] [19]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873--12883, 2021

work page 2021

[20] [20]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112--6121, 2019

work page 2019

[21] [21]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, and L. Zettlemoyer. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020

work page arXiv 2001

[22] [22]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

work page 2014

[23] [23]

Goyal, C

K. Goyal, C. Dyer, and T. Berg-Kirkpatrick. Exposing the implicit energy networks behind masked language models via metropolis--hastings. arXiv preprint arXiv:2106.02736, 2021

work page arXiv 2021

[24] [24]

J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

J. Gu, C. Wang, and J. Zhao. Levenshtein transformer. arXiv preprint arXiv:1905.11006, 2019

work page arXiv 1905

[26] [26]

X. Han, S. Kumar, and Y. Tsvetkov. SSD-LM : Semi-autoregressive simplex-based diffusion language model for text generation and modular control, 2022

work page 2022

[27] [27]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ' o, M. Wiebe, P. Peterson, P. G ' e rard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array prog...

work page doi:10.1038/s41586-020-2649-2 2020

[28] [28]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

work page 2020

[30] [30]

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b

work page 2022

[32] [32]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019

[33] [33]

Hoogeboom, A

E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021 a

work page arXiv 2021

[34] [34]

Hoogeboom, D

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr \'e , and M. Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 0 12454--12465, 2021 b

work page 2021

[35] [35]

X. S. Huang, F. Perez, and M. Volkovs. Improving non-autoregressive translation models without distillation. In International Conference on Learning Representations, 2022

work page 2022

[36] [36]

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9 0 (3): 0 90--95, 2007

work page 2007

[37] [37]

Hyv \"a rinen and P

A. Hyv \"a rinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

work page 2005

[38] [38]

Jaegle, F

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651--4664. PMLR, 2021

work page 2021

[39] [39]

Jayaram and J

V. Jayaram and J. Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In International Conference on Machine Learning, pages 4807--4818. PMLR, 2021

work page 2021

[40] [40]

Kaiser, S

L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pages 2390--2399. PMLR, 2018

work page 2018

[41] [41]

Elucidating the Design Space of Diffusion-Based Generative Models

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

Kasai, J

J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu. Non-autoregressive machine translation with disentangled context transformer. In International Conference on Machine Learning, pages 5144--5155. PMLR, 2020 a

work page 2020

[43] [43]

Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation.arXiv preprint arXiv:2006.10369, 2020

J. Kasai, N. Pappas, H. Peng, J. Cross, and N. A. Smith. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020 b

work page arXiv 2006

[44] [44]

P. Kidger. O n N eural D ifferential E quations . PhD thesis, University of Oxford, 2021

work page 2021

[45] [45]

Sequence-Level Knowledge Distillation

Y. Kim and A. M. Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[46] [46]

Kingma, T

D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

work page 2021

[47] [47]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

work page 2015

[48] [48]

D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018

work page 2018

[49] [49]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[50] [50]

X. Kong, Z. Zhang, and E. Hovy. Incorporating a local translation mechanism into non-autoregressive translation. arXiv preprint arXiv:2011.06132, 2020

work page arXiv 2011

[51] [51]

T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018

[52] [52]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[53] [53]

Kumar and W

S. Kumar and W. Byrne. Minimum B ayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004 , pages 169--176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics

work page 2004

[54] [54]

J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022

work page arXiv 2022

[56] [56]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

work page Pith review arXiv 2022

[57] [57]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022

[58] [58]

C. Meng, K. Choi, J. Song, and S. Ermon. Concrete score matching: Generalized score matching for discrete data. arXiv preprint arXiv:2211.00802, 2022

work page arXiv 2022

[59] [59]

M \"u ller, B

T. M \"u ller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov \'a k. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 0 (5): 0 1--19, 2019

work page 2019

[60] [60]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162--8171. PMLR, 2021

work page 2021

[61] [61]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002

work page 2002

[62] [62]

Perez, F

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[63] [63]

Pillutla, S

K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In NeurIPS, 2021

work page 2021

[64] [64]

M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics

work page 2018

[65] [65]

Press and L

O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain, Apr. 2017. Association for Computational Linguistics

work page 2017

[66] [66]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[67] [67]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019

[68] [68]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [69]

M. Reid, V. J. Hellendoorn, and G. Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

work page 2022

[70] [70]

D. J. Rezende and F. Viola. Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, NeurIPS, 2018

work page 2018

[71] [71]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278--1286. PMLR, 2014

work page 2014

[72] [72]

P. H. Richemond, S. Dieleman, and A. Doucet. Categorical sdes with simplex diffusion. arXiv preprint arXiv:2210.14784, 2022

work page arXiv 2022

[73] [73]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022

work page 2022

[74] [74]

Saharia, W

C. Saharia, W. Chan, S. Saxena, and M. Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020

work page arXiv 2004

[75] [75]

Saharia, W

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1--10, 2022 a

work page 2022

[76] [76]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022

[77] [77]

, author Chung, J

N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021

work page arXiv 2021

[78] [78]

A. Shih, D. Sadigh, and S. Ermon. Training and inference on any-order autoregressive models the right way. arXiv preprint arXiv:2205.13554, 2022

work page arXiv 2022

[79] [79]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256--2265. PMLR, 2015

work page 2015

[80] [80]

Song and S

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

work page 2019