Continuous diffusion for categorical data
Pith reviewed 2026-05-18 03:25 UTC · model grok-4.3
The pith
Categorical data such as language can be generated with diffusion models by first embedding tokens into a continuous vector space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. Discrete tokens are first mapped to continuous vectors; standard Gaussian diffusion is then performed in this embedding space, allowing the same continuous-time refinement procedure used for images and audio to operate on language and other categorical sequences. The resulting models are shown to perform effectively on several language modelling tasks.
What carries the argument
The CDCD framework, which maps categorical tokens to continuous vectors and applies Gaussian diffusion continuously in both time and that vector space.
If this is right
- Language modeling tasks can be solved by continuous-time iterative refinement rather than discrete token transitions.
- Standard Gaussian diffusion machinery transfers directly once tokens are embedded, avoiding the need for custom discrete noise schedules.
- The same framework applies across multiple language modeling benchmarks without changing the underlying diffusion process.
- Generation remains a gradual denoising trajectory that can be stopped at any continuous time point.
Where Pith is reading between the lines
- The method may allow a single diffusion backbone to handle mixed continuous and categorical inputs in multimodal settings.
- Continuous-time sampling could be tuned more finely than discrete-step schedules, potentially improving speed-quality trade-offs in text generation.
- Similar embedding-plus-diffusion pipelines might extend naturally to other categorical sequences such as source code or biological strings.
Load-bearing premise
Embedding discrete categorical tokens into a continuous vector space and running standard Gaussian diffusion there preserves the benefits of continuous diffusion without introducing new modeling errors that dominate performance.
What would settle it
Train a CDCD model and a strong discrete-diffusion baseline on the same language dataset and measure generation quality or perplexity; if the continuous-embedding version shows no advantage or clear degradation traceable to the embedding step, the central claim is falsified.
read the original abstract
Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space by embedding discrete tokens into a continuous vector space and applying Gaussian diffusion, and demonstrates its efficacy on several language modelling tasks.
Significance. If the central assumption holds—that Gaussian diffusion on learned token embeddings can recover categorical structure upon decoding without dominant approximation error—this work could significantly advance generative modeling for discrete data by preserving the benefits of continuous diffusion models, such as iterative refinement, for applications like language modeling.
major comments (1)
- [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.
minor comments (1)
- The abstract could more explicitly state the specific language modeling tasks used for demonstration.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address the major comment below and indicate the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.
Authors: We agree that a formal bound on discretization error would strengthen the presentation. The manuscript currently relies on the fact that the diffusion process itself operates entirely in continuous space and time, with the embedding learned jointly so that the decoder (nearest-neighbor or softmax) recovers coherent categorical sequences, as demonstrated by the language-modeling results. Deriving a tight, non-vacuous bound is non-trivial because the embedding is data-dependent and optimized end-to-end; any such bound would necessarily involve assumptions on the Lipschitz constant of the embedding map and the concentration of the learned representations. In the revised version we will add a short subsection in §3.1 that (i) explicitly states the decoding step as a post-processing operation that does not alter the continuous nature of the forward and reverse processes, (ii) reports empirical reconstruction error (continuous vector to nearest token) on held-out embeddings as a function of embedding dimension, and (iii) discusses why the observed error does not dominate the generative performance relative to purely discrete baselines. We believe this addresses the concern without requiring an intractable theoretical bound. revision: yes
Circularity Check
No circularity: derivation chain self-contained with no reductions to fitted inputs or self-citations
full rationale
The provided abstract and context describe a proposed framework CDCD that embeds categorical tokens into continuous space for Gaussian diffusion, with no equations, fitting procedures, or self-citations presented that would allow any claimed result to reduce to its inputs by construction. The central claim of preserving continuous diffusion benefits for discrete data is stated as a modeling choice and empirical demonstration rather than a derived prediction forced by prior definitions or author-specific uniqueness theorems. No load-bearing steps match the enumerated circularity patterns, as the text contains no explicit parameter fitting renamed as prediction, ansatz smuggling, or renaming of known results. This is the expected honest non-finding for a high-level proposal without detailed derivation visible in the given material.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 17 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Discrete Stochastic Localization for Non-autoregressive Generation
Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
-
Infinite Mask Diffusion for Few-Step Distillation
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
-
ELF: Embedded Language Flows
ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Consistent Diffusion Language Models
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, T. Cai, A. Clark, I. Danihelka, C. Fantacci, J. Godwin, C. Jones, T. Hennigan, M. Hessel, S. Kapturowski, T. Keck, I. Kemaev, M. King, L. Martens, V. Mikulik, T. Norman, J. Quan, G. Papamakarios, R. Ring, F. Ruiz, A. Sanchez, R. Schneider, E. Sezener, S. Spencer, S. Sri...
work page 2020
-
[4]
Efficient Training of Language Models to Fill in the Middle
M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [5]
-
[6]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018
work page 2018
- [7]
-
[8]
A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. arXiv preprint arXiv:2205.14987, 2022
-
[9]
W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly. Imputer: Sequence modelling via imputation and dynamic programming. In International Conference on Machine Learning, pages 1403--1413. PMLR, 2020
work page 2020
- [10]
- [11]
-
[12]
J. C. Cox, J. E. Ingersoll Jr, and S. A. Ross. A theory of the term structure of interest rates. Econometrica, 2: 0 385--407, 1985
work page 1985
-
[13]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Jukebox: A Generative Model for Music
P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[15]
S. Dieleman, C. Nash, J. Engel, and K. Simonyan. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021
-
[16]
C. Donahue, I. Simon, and S. Dieleman. Piano genie. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 160--164, 2019
work page 2019
- [17]
-
[18]
B. Eikema and W. Aziz. Sampling-based approximations to minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718, 2021
- [19]
-
[20]
M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112--6121, 2019
work page 2019
-
[21]
M. Ghazvininejad, O. Levy, and L. Zettlemoyer. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020
-
[22]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014
work page 2014
- [23]
-
[24]
J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [25]
-
[26]
X. Han, S. Kumar, and Y. Tsvetkov. SSD-LM : Semi-autoregressive simplex-based diffusion language model for text generation and modular control, 2022
work page 2022
-
[27]
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ' o, M. Wiebe, P. Peterson, P. G ' e rard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array prog...
-
[28]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[30]
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b
work page 2022
-
[32]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019
work page 2019
-
[33]
E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021 a
-
[34]
E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr \'e , and M. Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 0 12454--12465, 2021 b
work page 2021
-
[35]
X. S. Huang, F. Perez, and M. Volkovs. Improving non-autoregressive translation models without distillation. In International Conference on Learning Representations, 2022
work page 2022
-
[36]
J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9 0 (3): 0 90--95, 2007
work page 2007
-
[37]
A. Hyv \"a rinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005
work page 2005
- [38]
-
[39]
V. Jayaram and J. Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In International Conference on Machine Learning, pages 4807--4818. PMLR, 2021
work page 2021
- [40]
-
[41]
Elucidating the Design Space of Diffusion-Based Generative Models
T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [42]
- [43]
-
[44]
P. Kidger. O n N eural D ifferential E quations . PhD thesis, University of Oxford, 2021
work page 2021
-
[45]
Sequence-Level Knowledge Distillation
Y. Kim and A. M. Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [46]
-
[47]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015
work page 2015
-
[48]
D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018
work page 2018
-
[49]
D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
- [50]
-
[51]
T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007
-
[52]
T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
S. Kumar and W. Byrne. Minimum B ayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004 , pages 169--176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics
work page 2004
-
[54]
J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [55]
- [56]
-
[57]
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [58]
-
[59]
T. M \"u ller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov \'a k. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 0 (5): 0 1--19, 2019
work page 2019
-
[60]
A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162--8171. PMLR, 2021
work page 2021
-
[61]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002
work page 2002
- [62]
-
[63]
K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In NeurIPS, 2021
work page 2021
-
[64]
M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics
work page 2018
-
[65]
O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain, Apr. 2017. Association for Computational Linguistics
work page 2017
-
[66]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [67]
-
[68]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[69]
M. Reid, V. J. Hellendoorn, and G. Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022
work page 2022
-
[70]
D. J. Rezende and F. Viola. Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, NeurIPS, 2018
work page 2018
-
[71]
D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278--1286. PMLR, 2014
work page 2014
- [72]
-
[73]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022
work page 2022
-
[74]
C. Saharia, W. Chan, S. Saxena, and M. Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020
-
[75]
C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1--10, 2022 a
work page 2022
-
[76]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[77]
N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021
- [78]
-
[79]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256--2265. PMLR, 2015
work page 2015
-
[80]
Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.