arxiv: 2211.15089 · v3 · pith:SW3KZRUKnew · submitted 2022-11-28 · 💻 cs.CL · cs.LG

Continuous diffusion for categorical data

Sander Dieleman , Laurent Sartran , Arman Roshannai , Nikolay Savinov , Yaroslav Ganin , Pierre H. Richemond , Arnaud Doucet , Robin Strudel

show 6 more authors

Chris Dyer Conor Durkan Curtis Hawthorne R\'emi Leblond Will Grathwohl Jonas Adler

This is my paper

Pith reviewed 2026-05-18 03:25 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords diffusion modelscategorical datalanguage modelingcontinuous diffusiongenerative modelsembeddingsCDCD

0 comments

The pith

Categorical data such as language can be generated with diffusion models by first embedding tokens into a continuous vector space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to keep the core strengths of diffusion models—iterative refinement through a continuous process—when moving from perceptual signals like images to discrete data like text. Instead of inventing new discrete diffusion steps, it embeds each categorical token into a vector and then runs ordinary Gaussian diffusion over continuous time in that space. A sympathetic reader would care because this keeps the same smooth denoising trajectory and training dynamics that made diffusion models dominant for continuous domains, while still producing valid categorical outputs at the end. If the approach holds, generative modeling no longer needs to split into separate continuous and discrete families.

Core claim

We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. Discrete tokens are first mapped to continuous vectors; standard Gaussian diffusion is then performed in this embedding space, allowing the same continuous-time refinement procedure used for images and audio to operate on language and other categorical sequences. The resulting models are shown to perform effectively on several language modelling tasks.

What carries the argument

The CDCD framework, which maps categorical tokens to continuous vectors and applies Gaussian diffusion continuously in both time and that vector space.

If this is right

Language modeling tasks can be solved by continuous-time iterative refinement rather than discrete token transitions.
Standard Gaussian diffusion machinery transfers directly once tokens are embedded, avoiding the need for custom discrete noise schedules.
The same framework applies across multiple language modeling benchmarks without changing the underlying diffusion process.
Generation remains a gradual denoising trajectory that can be stopped at any continuous time point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may allow a single diffusion backbone to handle mixed continuous and categorical inputs in multimodal settings.
Continuous-time sampling could be tuned more finely than discrete-step schedules, potentially improving speed-quality trade-offs in text generation.
Similar embedding-plus-diffusion pipelines might extend naturally to other categorical sequences such as source code or biological strings.

Load-bearing premise

Embedding discrete categorical tokens into a continuous vector space and running standard Gaussian diffusion there preserves the benefits of continuous diffusion without introducing new modeling errors that dominate performance.

What would settle it

Train a CDCD model and a strong discrete-diffusion baseline on the same language dataset and measure generation quality or perplexity; if the continuous-embedding version shows no advantage or clear degradation traceable to the embedding step, the central claim is falsified.

read the original abstract

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDCD runs standard Gaussian diffusion on learned token embeddings then decodes back to categories, which produces usable language samples but leaves the discretization error unanalyzed.

read the letter

The core move is to map categorical tokens to continuous vectors, run ordinary diffusion there, and map back at the end. That keeps the sampling continuous in both time and space, which is the main technical claim. The experiments show the approach can train and sample on language tasks without collapsing, and the results are at least competitive with some discrete diffusion baselines on the reported metrics. Credit for actually shipping working code and numbers instead of just the abstract idea. The adaptation of the loss and the embedding training look like standard diffusion machinery with a straightforward categorical wrapper, which is fine as far as it goes. The paper is clear about what it is trying to do and does not overclaim the gains. The soft spot is exactly the one the stress-test flagged: there is no bound or measurement of how much error the final discretization step introduces, nor any controlled test showing that the continuous diffusion itself is doing more work than the choice of embedding space. If the embeddings already separate the categories well, the diffusion may be adding little beyond what a simpler autoregressive or discrete model could do. The results could be driven more by embedding quality than by the diffusion process, and without that separation the continuous benefit is hard to credit. This is the kind of paper that belongs in a reading group for people who follow diffusion extensions to discrete domains. It is worth a serious referee because the implementation is reproducible and the central engineering question is well-posed, even if the current evidence does not yet prove the continuous advantage is load-bearing. I would send it out for review with a request for tighter analysis of the embedding-to-token mapping error.

Referee Report

1 major / 1 minor

Summary. The paper proposes CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space by embedding discrete tokens into a continuous vector space and applying Gaussian diffusion, and demonstrates its efficacy on several language modelling tasks.

Significance. If the central assumption holds—that Gaussian diffusion on learned token embeddings can recover categorical structure upon decoding without dominant approximation error—this work could significantly advance generative modeling for discrete data by preserving the benefits of continuous diffusion models, such as iterative refinement, for applications like language modeling.

major comments (1)

[§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.

minor comments (1)

The abstract could more explicitly state the specific language modeling tasks used for demonstration.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses

Referee: [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.

Authors: We agree that a formal bound on discretization error would strengthen the presentation. The manuscript currently relies on the fact that the diffusion process itself operates entirely in continuous space and time, with the embedding learned jointly so that the decoder (nearest-neighbor or softmax) recovers coherent categorical sequences, as demonstrated by the language-modeling results. Deriving a tight, non-vacuous bound is non-trivial because the embedding is data-dependent and optimized end-to-end; any such bound would necessarily involve assumptions on the Lipschitz constant of the embedding map and the concentration of the learned representations. In the revised version we will add a short subsection in §3.1 that (i) explicitly states the decoding step as a post-processing operation that does not alter the continuous nature of the forward and reverse processes, (ii) reports empirical reconstruction error (continuous vector to nearest token) on held-out embeddings as a function of embedding dimension, and (iii) discusses why the observed error does not dominate the generative performance relative to purely discrete baselines. We believe this addresses the concern without requiring an intractable theoretical bound. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained with no reductions to fitted inputs or self-citations

full rationale

The provided abstract and context describe a proposed framework CDCD that embeds categorical tokens into continuous space for Gaussian diffusion, with no equations, fitting procedures, or self-citations presented that would allow any claimed result to reduce to its inputs by construction. The central claim of preserving continuous diffusion benefits for discrete data is stated as a modeling choice and empirical demonstration rather than a derived prediction forced by prior definitions or author-specific uniqueness theorems. No load-bearing steps match the enumerated circularity patterns, as the text contains no explicit parameter fitting renamed as prediction, ansatz smuggling, or renaming of known results. This is the expected honest non-finding for a high-level proposal without detailed derivation visible in the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the unstated assumption that a continuous embedding of discrete tokens can be denoised with standard diffusion dynamics and then mapped back without loss of the original categorical structure.

pith-pipeline@v0.9.0 · 5663 in / 1036 out tokens · 18730 ms · 2026-05-18T03:25:47.881749+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Dream 7B: Diffusion Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 17 Pith papers · 17 internal anchors

[1]

Abadi, P

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation ( OSDI 16) , pages 265--283, 2016

work page 2016
[2]

Austin, D

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34: 0 17981--17993, 2021

work page 2021
[3]

Babuschkin, K

I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, T. Cai, A. Clark, I. Danihelka, C. Fantacci, J. Godwin, C. Jones, T. Hennigan, M. Hessel, S. Kapturowski, T. Keck, I. Kemaev, M. King, L. Martens, V. Mikulik, T. Norman, J. Quan, G. Papamakarios, R. Ring, F. Ruiz, A. Sanchez, R. Schneider, E. Sezener, S. Spencer, S. Sri...

work page 2020
[4]

Efficient Training of Language Models to Fill in the Middle

M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Borsos, R

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM : a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022

work page arXiv 2022
[6]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018

work page 2018
[7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[8]

Campbell, J

A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. arXiv preprint arXiv:2205.14987, 2022

work page arXiv 2022
[9]

W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly. Imputer: Sequence modelling via imputation and dynamic programming. In International Conference on Machine Learning, pages 1403--1413. PMLR, 2020

work page 2020
[10]

Chang, H

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315--11325, 2022

work page 2022
[11]

T. Chen, R. Zhang, and G. Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022

work page arXiv 2022
[12]

J. C. Cox, J. E. Ingersoll Jr, and S. A. Ross. A theory of the term structure of interest rates. Econometrica, 2: 0 385--407, 1985

work page 1985
[13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[15]

Dieleman, C

S. Dieleman, C. Nash, J. Engel, and K. Simonyan. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021

work page arXiv 2021
[16]

Donahue, I

C. Donahue, I. Simon, and S. Dieleman. Piano genie. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 160--164, 2019

work page 2019
[17]

Durkan, A

C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

work page 2019
[18]

Eikema and W

B. Eikema and W. Aziz. Sampling-based approximations to minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718, 2021

work page arXiv 2021
[19]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873--12883, 2021

work page 2021
[20]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112--6121, 2019

work page 2019
[21]

Ghazvininejad, O

M. Ghazvininejad, O. Levy, and L. Zettlemoyer. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020

work page arXiv 2001
[22]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

work page 2014
[23]

Goyal, C

K. Goyal, C. Dyer, and T. Berg-Kirkpatrick. Exposing the implicit energy networks behind masked language models via metropolis--hastings. arXiv preprint arXiv:2106.02736, 2021

work page arXiv 2021
[24]

J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

J. Gu, C. Wang, and J. Zhao. Levenshtein transformer. arXiv preprint arXiv:1905.11006, 2019

work page arXiv 1905
[26]

X. Han, S. Kumar, and Y. Tsvetkov. SSD-LM : Semi-autoregressive simplex-based diffusion language model for text generation and modular control, 2022

work page 2022
[27]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ' o, M. Wiebe, P. Peterson, P. G ' e rard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array prog...

work page doi:10.1038/s41586-020-2649-2 2020
[28]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

work page 2020
[30]

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b

work page 2022
[32]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019
[33]

Hoogeboom, A

E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021 a

work page arXiv 2021
[34]

Hoogeboom, D

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr \'e , and M. Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 0 12454--12465, 2021 b

work page 2021
[35]

X. S. Huang, F. Perez, and M. Volkovs. Improving non-autoregressive translation models without distillation. In International Conference on Learning Representations, 2022

work page 2022
[36]

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9 0 (3): 0 90--95, 2007

work page 2007
[37]

Hyv \"a rinen and P

A. Hyv \"a rinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

work page 2005
[38]

Jaegle, F

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651--4664. PMLR, 2021

work page 2021
[39]

Jayaram and J

V. Jayaram and J. Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In International Conference on Machine Learning, pages 4807--4818. PMLR, 2021

work page 2021
[40]

Kaiser, S

L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pages 2390--2399. PMLR, 2018

work page 2018
[41]

Elucidating the Design Space of Diffusion-Based Generative Models

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Kasai, J

J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu. Non-autoregressive machine translation with disentangled context transformer. In International Conference on Machine Learning, pages 5144--5155. PMLR, 2020 a

work page 2020
[43]

Kasai, N

J. Kasai, N. Pappas, H. Peng, J. Cross, and N. A. Smith. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020 b

work page arXiv 2006
[44]

P. Kidger. O n N eural D ifferential E quations . PhD thesis, University of Oxford, 2021

work page 2021
[45]

Sequence-Level Knowledge Distillation

Y. Kim and A. M. Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Kingma, T

D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

work page 2021
[47]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

work page 2015
[48]

D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018

work page 2018
[49]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[50]

X. Kong, Z. Zhang, and E. Hovy. Incorporating a local translation mechanism into non-autoregressive translation. arXiv preprint arXiv:2011.06132, 2020

work page arXiv 2011
[51]

T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[52]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Kumar and W

S. Kumar and W. Byrne. Minimum B ayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004 , pages 169--176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics

work page 2004
[54]

J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022

work page arXiv 2022
[56]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

work page arXiv 2022
[57]

C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

C. Meng, K. Choi, J. Song, and S. Ermon. Concrete score matching: Generalized score matching for discrete data. arXiv preprint arXiv:2211.00802, 2022

work page arXiv 2022
[59]

M \"u ller, B

T. M \"u ller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov \'a k. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 0 (5): 0 1--19, 2019

work page 2019
[60]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162--8171. PMLR, 2021

work page 2021
[61]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002

work page 2002
[62]

Perez, F

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[63]

Pillutla, S

K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In NeurIPS, 2021

work page 2021
[64]

M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics

work page 2018
[65]

Press and L

O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain, Apr. 2017. Association for Computational Linguistics

work page 2017
[66]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019
[68]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

M. Reid, V. J. Hellendoorn, and G. Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

work page 2022
[70]

D. J. Rezende and F. Viola. Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, NeurIPS, 2018

work page 2018
[71]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278--1286. PMLR, 2014

work page 2014
[72]

P. H. Richemond, S. Dieleman, and A. Doucet. Categorical sdes with simplex diffusion. arXiv preprint arXiv:2210.14784, 2022

work page arXiv 2022
[73]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022

work page 2022
[74]

Saharia, W

C. Saharia, W. Chan, S. Saxena, and M. Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020

work page arXiv 2004
[75]

Saharia, W

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1--10, 2022 a

work page 2022
[76]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Savinov, J

N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021

work page arXiv 2021
[78]

A. Shih, D. Sadigh, and S. Ermon. Training and inference on any-order autoregressive models the right way. arXiv preprint arXiv:2205.13554, 2022

work page arXiv 2022
[79]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256--2265. PMLR, 2015

work page 2015
[80]

Song and S

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

Showing first 80 references.