pith. machine review for the scientific record. sign in

arxiv: 2211.15089 · v3 · pith:SW3KZRUKnew · submitted 2022-11-28 · 💻 cs.CL · cs.LG

Continuous diffusion for categorical data

Pith reviewed 2026-05-18 03:25 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords diffusion modelscategorical datalanguage modelingcontinuous diffusiongenerative modelsembeddingsCDCD
0
0 comments X

The pith

Categorical data such as language can be generated with diffusion models by first embedding tokens into a continuous vector space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to keep the core strengths of diffusion models—iterative refinement through a continuous process—when moving from perceptual signals like images to discrete data like text. Instead of inventing new discrete diffusion steps, it embeds each categorical token into a vector and then runs ordinary Gaussian diffusion over continuous time in that space. A sympathetic reader would care because this keeps the same smooth denoising trajectory and training dynamics that made diffusion models dominant for continuous domains, while still producing valid categorical outputs at the end. If the approach holds, generative modeling no longer needs to split into separate continuous and discrete families.

Core claim

We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. Discrete tokens are first mapped to continuous vectors; standard Gaussian diffusion is then performed in this embedding space, allowing the same continuous-time refinement procedure used for images and audio to operate on language and other categorical sequences. The resulting models are shown to perform effectively on several language modelling tasks.

What carries the argument

The CDCD framework, which maps categorical tokens to continuous vectors and applies Gaussian diffusion continuously in both time and that vector space.

If this is right

  • Language modeling tasks can be solved by continuous-time iterative refinement rather than discrete token transitions.
  • Standard Gaussian diffusion machinery transfers directly once tokens are embedded, avoiding the need for custom discrete noise schedules.
  • The same framework applies across multiple language modeling benchmarks without changing the underlying diffusion process.
  • Generation remains a gradual denoising trajectory that can be stopped at any continuous time point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may allow a single diffusion backbone to handle mixed continuous and categorical inputs in multimodal settings.
  • Continuous-time sampling could be tuned more finely than discrete-step schedules, potentially improving speed-quality trade-offs in text generation.
  • Similar embedding-plus-diffusion pipelines might extend naturally to other categorical sequences such as source code or biological strings.

Load-bearing premise

Embedding discrete categorical tokens into a continuous vector space and running standard Gaussian diffusion there preserves the benefits of continuous diffusion without introducing new modeling errors that dominate performance.

What would settle it

Train a CDCD model and a strong discrete-diffusion baseline on the same language dataset and measure generation quality or perplexity; if the continuous-embedding version shows no advantage or clear degradation traceable to the embedding step, the central claim is falsified.

read the original abstract

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space by embedding discrete tokens into a continuous vector space and applying Gaussian diffusion, and demonstrates its efficacy on several language modelling tasks.

Significance. If the central assumption holds—that Gaussian diffusion on learned token embeddings can recover categorical structure upon decoding without dominant approximation error—this work could significantly advance generative modeling for discrete data by preserving the benefits of continuous diffusion models, such as iterative refinement, for applications like language modeling.

major comments (1)
  1. [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.
minor comments (1)
  1. The abstract could more explicitly state the specific language modeling tasks used for demonstration.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.1] The description of the embedding and diffusion process lacks a formal analysis or bound on the discretization error in the final decoding step (e.g., nearest neighbor or softmax), which is load-bearing for the claim that the model remains effectively continuous in input space without the embedding choice dominating performance.

    Authors: We agree that a formal bound on discretization error would strengthen the presentation. The manuscript currently relies on the fact that the diffusion process itself operates entirely in continuous space and time, with the embedding learned jointly so that the decoder (nearest-neighbor or softmax) recovers coherent categorical sequences, as demonstrated by the language-modeling results. Deriving a tight, non-vacuous bound is non-trivial because the embedding is data-dependent and optimized end-to-end; any such bound would necessarily involve assumptions on the Lipschitz constant of the embedding map and the concentration of the learned representations. In the revised version we will add a short subsection in §3.1 that (i) explicitly states the decoding step as a post-processing operation that does not alter the continuous nature of the forward and reverse processes, (ii) reports empirical reconstruction error (continuous vector to nearest token) on held-out embeddings as a function of embedding dimension, and (iii) discusses why the observed error does not dominate the generative performance relative to purely discrete baselines. We believe this addresses the concern without requiring an intractable theoretical bound. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained with no reductions to fitted inputs or self-citations

full rationale

The provided abstract and context describe a proposed framework CDCD that embeds categorical tokens into continuous space for Gaussian diffusion, with no equations, fitting procedures, or self-citations presented that would allow any claimed result to reduce to its inputs by construction. The central claim of preserving continuous diffusion benefits for discrete data is stated as a modeling choice and empirical demonstration rather than a derived prediction forced by prior definitions or author-specific uniqueness theorems. No load-bearing steps match the enumerated circularity patterns, as the text contains no explicit parameter fitting renamed as prediction, ansatz smuggling, or renaming of known results. This is the expected honest non-finding for a high-level proposal without detailed derivation visible in the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the unstated assumption that a continuous embedding of discrete tokens can be denoised with standard diffusion dynamics and then mapped back without loss of the original categorical structure.

pith-pipeline@v0.9.0 · 5663 in / 1036 out tokens · 18730 ms · 2026-05-18T03:25:47.881749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  2. Discrete Stochastic Localization for Non-autoregressive Generation

    cs.LG 2026-05 unverdicted novelty 7.0

    Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.

  3. Infinite Mask Diffusion for Few-Step Distillation

    cs.CL 2026-05 unverdicted novelty 7.0

    Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.

  4. Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast

    cs.CL 2026-05 unverdicted novelty 7.0

    FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.

  5. LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

    cs.CL 2026-04 unverdicted novelty 7.0

    LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.

  6. Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    cs.CL 2026-02 unverdicted novelty 7.0

    Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.

  7. Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    cs.AI 2025-10 unverdicted novelty 7.0

    CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

  8. Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

    cs.CL 2026-05 unverdicted novelty 6.0

    Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.

  9. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  10. TextLDM: Language Modeling with Continuous Latent Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

  11. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  12. Consistent Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.

  13. Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

  14. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

  15. Dream 7B: Diffusion Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...

  16. Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    cs.CL 2025-08 unverdicted novelty 6.0

    Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.

  17. LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    cs.LG 2025-05 conditional novelty 6.0

    LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 17 Pith papers · 17 internal anchors

  1. [1]

    Abadi, P

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation ( OSDI 16) , pages 265--283, 2016

  2. [2]

    Austin, D

    J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34: 0 17981--17993, 2021

  3. [3]

    Babuschkin, K

    I. Babuschkin, K. Baumli, A. Bell, S. Bhupatiraju, J. Bruce, P. Buchlovsky, D. Budden, T. Cai, A. Clark, I. Danihelka, C. Fantacci, J. Godwin, C. Jones, T. Hennigan, M. Hessel, S. Kapturowski, T. Keck, I. Kemaev, M. King, L. Martens, V. Mikulik, T. Norman, J. Quan, G. Papamakarios, R. Ring, F. Ruiz, A. Sanchez, R. Schneider, E. Sezener, S. Spencer, S. Sri...

  4. [4]

    Efficient Training of Language Models to Fill in the Middle

    M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022

  5. [5]

    Borsos, R

    Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM : a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022

  6. [6]

    Bradbury, R

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018

  7. [7]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  8. [8]

    Campbell, J

    A. Campbell, J. Benton, V. De Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. arXiv preprint arXiv:2205.14987, 2022

  9. [9]

    W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly. Imputer: Sequence modelling via imputation and dynamic programming. In International Conference on Machine Learning, pages 1403--1413. PMLR, 2020

  10. [10]

    Chang, H

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315--11325, 2022

  11. [11]

    T. Chen, R. Zhang, and G. Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022

  12. [12]

    J. C. Cox, J. E. Ingersoll Jr, and S. A. Ross. A theory of the term structure of interest rates. Econometrica, 2: 0 385--407, 1985

  13. [13]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  14. [14]

    Jukebox: A Generative Model for Music

    P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020

  15. [15]

    Dieleman, C

    S. Dieleman, C. Nash, J. Engel, and K. Simonyan. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021

  16. [16]

    Donahue, I

    C. Donahue, I. Simon, and S. Dieleman. Piano genie. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 160--164, 2019

  17. [17]

    Durkan, A

    C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

  18. [18]

    Eikema and W

    B. Eikema and W. Aziz. Sampling-based approximations to minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718, 2021

  19. [19]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873--12883, 2021

  20. [20]

    Ghazvininejad, O

    M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112--6121, 2019

  21. [21]

    Ghazvininejad, O

    M. Ghazvininejad, O. Levy, and L. Zettlemoyer. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020

  22. [22]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

  23. [23]

    Goyal, C

    K. Goyal, C. Dyer, and T. Berg-Kirkpatrick. Exposing the implicit energy networks behind masked language models via metropolis--hastings. arXiv preprint arXiv:2106.02736, 2021

  24. [24]

    J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017

  25. [25]

    J. Gu, C. Wang, and J. Zhao. Levenshtein transformer. arXiv preprint arXiv:1905.11006, 2019

  26. [26]

    X. Han, S. Kumar, and Y. Tsvetkov. SSD-LM : Semi-autoregressive simplex-based diffusion language model for text generation and modular control, 2022

  27. [27]

    C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ' o, M. Wiebe, P. Peterson, P. G ' e rard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array prog...

  28. [28]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  29. [29]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  30. [30]

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

  31. [31]

    J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b

  32. [32]

    Holtzman, J

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

  33. [33]

    Hoogeboom, A

    E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021 a

  34. [34]

    Hoogeboom, D

    E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr \'e , and M. Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 0 12454--12465, 2021 b

  35. [35]

    X. S. Huang, F. Perez, and M. Volkovs. Improving non-autoregressive translation models without distillation. In International Conference on Learning Representations, 2022

  36. [36]

    J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9 0 (3): 0 90--95, 2007

  37. [37]

    Hyv \"a rinen and P

    A. Hyv \"a rinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

  38. [38]

    Jaegle, F

    A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651--4664. PMLR, 2021

  39. [39]

    Jayaram and J

    V. Jayaram and J. Thickstun. Parallel and flexible sampling from autoregressive models via langevin dynamics. In International Conference on Machine Learning, pages 4807--4818. PMLR, 2021

  40. [40]

    Kaiser, S

    L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pages 2390--2399. PMLR, 2018

  41. [41]

    Elucidating the Design Space of Diffusion-Based Generative Models

    T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022

  42. [42]

    Kasai, J

    J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu. Non-autoregressive machine translation with disentangled context transformer. In International Conference on Machine Learning, pages 5144--5155. PMLR, 2020 a

  43. [43]

    Kasai, N

    J. Kasai, N. Pappas, H. Peng, J. Cross, and N. A. Smith. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020 b

  44. [44]

    P. Kidger. O n N eural D ifferential E quations . PhD thesis, University of Oxford, 2021

  45. [45]

    Sequence-Level Knowledge Distillation

    Y. Kim and A. M. Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016

  46. [46]

    Kingma, T

    D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

  47. [47]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

  48. [48]

    D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018

  49. [49]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  50. [50]

    X. Kong, Z. Zhang, and E. Hovy. Incorporating a local translation mechanism into non-autoregressive translation. arXiv preprint arXiv:2011.06132, 2020

  51. [51]

    T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007

  52. [52]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

  53. [53]

    Kumar and W

    S. Kumar and W. Byrne. Minimum B ayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004 , pages 169--176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics

  54. [54]

    J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018

  55. [55]

    X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022

  56. [56]

    C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

  57. [57]

    C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

  58. [58]

    C. Meng, K. Choi, J. Song, and S. Ermon. Concrete score matching: Generalized score matching for discrete data. arXiv preprint arXiv:2211.00802, 2022

  59. [59]

    M \"u ller, B

    T. M \"u ller, B. McWilliams, F. Rousselle, M. Gross, and J. Nov \'a k. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 0 (5): 0 1--19, 2019

  60. [60]

    A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162--8171. PMLR, 2021

  61. [61]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002

  62. [62]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  63. [63]

    Pillutla, S

    K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In NeurIPS, 2021

  64. [64]

    M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics

  65. [65]

    Press and L

    O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain, Apr. 2017. Association for Computational Linguistics

  66. [66]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

  67. [67]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

  68. [68]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  69. [69]

    M. Reid, V. J. Hellendoorn, and G. Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

  70. [70]

    D. J. Rezende and F. Viola. Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, NeurIPS, 2018

  71. [71]

    D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278--1286. PMLR, 2014

  72. [72]

    P. H. Richemond, S. Dieleman, and A. Doucet. Categorical sdes with simplex diffusion. arXiv preprint arXiv:2210.14784, 2022

  73. [73]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022

  74. [74]

    Saharia, W

    C. Saharia, W. Chan, S. Saxena, and M. Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020

  75. [75]

    Saharia, W

    C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1--10, 2022 a

  76. [76]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022 b

  77. [77]

    Savinov, J

    N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021

  78. [78]

    A. Shih, D. Sadigh, and S. Ermon. Training and inference on any-order autoregressive models the right way. arXiv preprint arXiv:2205.13554, 2022

  79. [79]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256--2265. PMLR, 2015

  80. [80]

    Song and S

    Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

Showing first 80 references.