When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

De Shuai Zhang

arxiv: 2605.15557 · v1 · pith:V7P27EQInew · submitted 2026-05-15 · 💻 cs.CL · cs.LG

When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

De Shuai Zhang This is my paper

Pith reviewed 2026-05-20 19:42 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords latent geometrynon-autoregressive text generationdecoder recoverabilitydraft-conditioned refinementcontinuous latent modelsBERT latentsROCStoriesflow refinement

0 comments

The pith

Latent geometry alone does not guarantee that generated latents decode to coherent tokens

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in continuous latent approaches to non-autoregressive text generation, matching geometry in latent space does not ensure the decoder can recover sensible tokens. Experiments reveal that latents close to real encoder outputs can still lead to high-entropy or repetitive token distributions. The authors therefore shift focus to draft-conditioned local refinement and argue that models should be judged by decoder recoverability and preservation of decoder-readable structure. This matters for building reliable parallel generation systems that avoid the pitfalls of pure latent-space optimization.

Core claim

The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions from the decoder.

What carries the argument

Draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet, which performs controlled local refinement of draft latents rather than full generation from noise.

If this is right

Full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents.
DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout.
Local flow refinement and fused decoder-aware readout give modest additional gains over the draft prior.
Metric learning and OT-style alignment improve geometry but do not close the decoder gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of diffusion or flow models for text may need to optimize directly for decoder compatibility in the latent objective rather than post-hoc refinement.
The diagnostic criteria of decoder recoverability could extend to evaluating continuous models for other discrete outputs such as code or structured data.
Joint training of the latent prior with the decoder might eliminate the observed gap between geometric closeness and usable token distributions.

Load-bearing premise

The experiments assume that token recovery probability from the parallel decoder on the ROCStories dataset with controlled draft dropout directly measures whether latent refinement has preserved decoder-readable structure, rather than reflecting other factors such as decoder capacity or dataset bias.

What would settle it

An observation that a method achieving strong latent geometry and scale matching but low decoder token recovery still produces high-quality text when decoded, or that geometric improvements alone close the performance gap without decoder-aware components.

Figures

Figures reproduced from arXiv: 2605.15557 by De Shuai Zhang.

read the original abstract

Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that good latent geometry in continuous text diffusion models often fails to produce recoverable tokens from the decoder, based on dropout experiments on ROCStories.

read the letter

The main takeaway is that latent geometry alone does not guarantee usable outputs in non-autoregressive text generation. The authors report that latents can match real encoder outputs on scale and cosine metrics yet still produce high-entropy or biased tokens from the parallel decoder. They therefore treat the task as controlled local refinement rather than generation from noise, using a frozen BERT encoder, parallel decoder, DraftPrior, local FlowNet, and learned MetricNet on ROCStories with the first two sentences as prompt and the last three as target. The concrete observations are the sharp drop in target-token probability with draft dropout (0.938 clean, 0.613 at 3%, 0.483 at 5%, 0.272 at 10%) and the large gap between 768-dimensional and 256-dimensional latents. Geometry fixes via MetricNet or OT-style alignment improve the metrics but do not close the decoder gap, while flow refinement adds only modest gains. This leads to their diagnostic claim that evaluation should check decoder recoverability, start distribution quality, and whether refinement preserves decoder-readable structure. The work does a reasonable job documenting this mismatch with specific numbers and shifting emphasis away from pure latent-space diagnostics. The soft spots are that the evidence lacks error bars, statistical tests, and full training details, and the experiments stay on a single dataset with a fixed decoder. The stress-test concern that the recovery gap may reflect decoder capacity limits rather than a general failure of latent geometry is plausible and not clearly ruled out by the reported results. This paper is mainly for researchers working on diffusion or flow models for text who might be over-relying on latent metrics. A reader focused on evaluation practices in non-autoregressive generation would find it relevant. It deserves a serious referee because the diagnostic angle could steer the subfield toward more reliable checks. I would recommend sending it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that latent geometry alone is insufficient for effective continuous latent text generation in non-autoregressive settings. Using a draft-conditioned refinement model built around a frozen BERT encoder, parallel decoder, DraftPrior, FlowNet, and MetricNet, experiments on ROCStories (first two sentences as prompt, last three as target) show that generated latents can match real ones on scale and cosine metrics yet produce high-entropy or biased token outputs; full 768-dim latents recover tokens far better than 256-dim ones, and target-token probability falls from 0.938 (clean) to 0.272 (10% draft dropout), with geometric or OT-style improvements failing to close the decoder gap. The central diagnostic conclusion is that evaluation must prioritize decoder recoverability, start-distribution quality, and preservation of decoder-readable structure.

Significance. If the empirical observations hold, the work supplies a useful cautionary result for the non-autoregressive generation community: purely geometric objectives in latent space are not guaranteed to yield decodable outputs, and future models should incorporate decoder-aware refinement and evaluation. The concrete dropout-sensitivity numbers and dimension comparison provide falsifiable benchmarks that could steer research away from geometry-only baselines.

major comments (2)

[Abstract] Abstract and experimental results: the reported target-token probabilities (0.938 clean, 0.613 at 3% dropout, 0.483 at 5%, 0.272 at 10%) are presented without error bars, number of evaluation examples, or statistical tests; because these quantities are load-bearing for the claim that latent geometry is insufficient, the absence of uncertainty quantification leaves the strength of the diagnostic result unclear.
[Experiments] Experimental setup: the interpretation that persistent decoder gaps after MetricNet/OT alignment demonstrate an intrinsic limitation of latent geometry assumes the frozen parallel decoder (trained on clean 768-dim BERT latents from ROCStories) is robust to any non-geometric deviation; the manuscript does not report controls that would separate decoder capacity limits from the claimed structural failure.

minor comments (1)

[Abstract] The abstract introduces 'fused decoder-aware readout' and 'learned diagonal MetricNet' without a short definition or pointer to their precise formulation, which reduces immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the reported target-token probabilities (0.938 clean, 0.613 at 3% dropout, 0.483 at 5%, 0.272 at 10%) are presented without error bars, number of evaluation examples, or statistical tests; because these quantities are load-bearing for the claim that latent geometry is insufficient, the absence of uncertainty quantification leaves the strength of the diagnostic result unclear.

Authors: We agree that uncertainty quantification would improve the robustness of these central results. In the revised manuscript we will add error bars computed across multiple random seeds, explicitly state the size of the evaluation set drawn from the ROCStories test split, and include a short note on the statistical significance of the observed probability drops. revision: yes
Referee: [Experiments] Experimental setup: the interpretation that persistent decoder gaps after MetricNet/OT alignment demonstrate an intrinsic limitation of latent geometry assumes the frozen parallel decoder (trained on clean 768-dim BERT latents from ROCStories) is robust to any non-geometric deviation; the manuscript does not report controls that would separate decoder capacity limits from the claimed structural failure.

Authors: This observation is fair and highlights a missing control. We will revise the experimental section to add a short discussion and a simple control (small isotropic noise injected into clean latents) that demonstrates the decoder's sensitivity to structural deviations beyond Euclidean or cosine distance. This clarification supports rather than undermines the diagnostic claim, while acknowledging that a fuller ablation of decoder capacity would be desirable in future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper reports experimental observations on a frozen BERT encoder and parallel decoder using the ROCStories dataset, including specific token recovery probabilities (0.938 clean, 0.613 at 3% dropout, 0.483 at 5%, 0.272 at 10%) and comparisons between 768-dim and 256-dim latents. These quantities are measured outputs from controlled draft dropout and refinement runs, not quantities defined in terms of fitted parameters or reduced by construction to the inputs. No mathematical derivation chain, uniqueness theorem, or self-citation load-bearing step is present; the diagnostic conclusion that latent geometry alone is insufficient follows from observed decoder gaps despite geometric similarity, without any self-referential loops or renaming of known results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model rests on the domain assumption that a frozen BERT encoder yields latents from which a parallel decoder can recover tokens, plus several architectural choices whose effectiveness is measured empirically rather than derived.

free parameters (2)

latent dimension
768 versus 256 dimensions chosen to test full versus compressed BERT representations; directly affects reported recovery performance.
draft dropout rates
3 percent, 5 percent, and 10 percent dropout levels used to create noisy drafts; values selected to demonstrate performance degradation.

axioms (1)

domain assumption A frozen BERT encoder produces continuous latents suitable for subsequent refinement and token recovery by a parallel decoder.
Invoked when the paper states that full 768-dimensional BERT latents recover tokens much better than compressed versions.

pith-pipeline@v0.9.0 · 5808 in / 1403 out tokens · 66286 ms · 2026-05-20T19:42:09.946908+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need.NeurIPS

work page 2017
[2]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.NAACL-HLT

work page 2019
[3]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.OpenAI Technical Report

work page 2019
[4]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.ACL

work page 2020
[5]

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. (2016). A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories.NAACL-HLT

work page 2016
[6]

Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. (2021). MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers.NeurIPS

work page 2021
[7]

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural Ordinary Differential Equations.NeurIPS

work page 2018
[8]

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow Matching for Generative Modeling.ICLR

work page 2023
[9]

Chen, R. T. Q., and Lipman, Y. (2024). Flow Matching on General Geometries.ICLR

work page 2024
[10]

Liu, X., Gong, C., and Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.ICLR. 14

work page 2023
[11]

Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. (2024). Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport.TMLR

work page 2024
[12]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic Interpolants: A Unifying Framework for Flows and Diffusions.arXiv:2303.08797

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models.NeurIPS

work page 2020
[14]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations.ICLR

work page 2021
[15]

Q., and Dhariwal, P

Nichol, A. Q., and Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML

work page 2021
[16]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.CVPR

work page 2022
[17]

Zhu, Y., Lan, Y., and Cheng, X. (2024). Segment-Level Diffusion for Long-Form Text Generation. arXiv preprint

work page 2024
[18]

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. (2022). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.ICLR

work page 2022
[19]

L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T

Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. B. (2022). Diffusion-LM Improves Controllable Text Generation.NeurIPS

work page 2022
[20]

Lovelace, J., Kishore, V., Wan, C., Shekhtman, E., and Weinberger, K. Q. (2023). Latent Diffusion for Language Generation.NeurIPS

work page 2023
[21]

He, Z., Sun, T., Wang, K., Huang, X., Qiu, X., and Tang, Q. (2023). DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models.ACL

work page 2023
[22]

D., Ho, J., Tarlow, D., and van den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces.NeurIPS

work page 2021
[23]

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. (2023). DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models.ICLR

work page 2023
[24]

Continuous diffusion for categorical data

Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., Hawthorne, C., Leblond, R., Grathwohl, W., and Adler, J. (2022). Continuous Diffusion for Categorical Data.arXiv:2211.15089

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. (2018). Non-Autoregressive Neural Machine Translation.ICLR

work page 2018
[26]

Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. (2019). Mask-Predict: Parallel Decoding of Conditional Masked Language Models.EMNLP-IJCNLP

work page 2019
[27]

Gu, J., Wang, C., and Zhao, J. (2019). Levenshtein Transformer.NeurIPS

work page 2019
[28]

Stern, M., Chan, W., Kiros, J., and Uszkoreit, J. (2019). Insertion Transformer: Flexible Sequence Generation via Insertion Operations.ICML

work page 2019
[29]

Ghazvininejad, M., Karpukhin, V., Zettlemoyer, L., and Levy, O. (2020). Aligned Cross Entropy for Non-Autoregressive Machine Translation.ICML. 15

work page 2020
[30]

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). The Curious Case of Neural Text Degeneration.ICLR

work page 2020
[31]

R., Sun, Q., Lee, S., Crandall, D., and Batra, D

Vijayakumar, A., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. (2018). Diverse Beam Search for Improved Description of Complex Scenes.AAAI

work page 2018
[32]

Meshchaninov, V., Chimbulatov, E., Shabalin, A., Abramov, A., and Vetrov, D. (2025). COSMOS: Compressed and Smooth Latent Space for Text Diffusion Modeling. InAdvances in Neural Information Processing Systems 38 (NeurIPS). 16 Table 6: Selected qualitative samples.Refis the ground-truth continuation;Predis the system output. Substitution errors are shown i...

work page 2025

[1] [1]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need.NeurIPS

work page 2017

[2] [2]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.NAACL-HLT

work page 2019

[3] [3]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.OpenAI Technical Report

work page 2019

[4] [4]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.ACL

work page 2020

[5] [5]

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. (2016). A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories.NAACL-HLT

work page 2016

[6] [6]

Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. (2021). MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers.NeurIPS

work page 2021

[7] [7]

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural Ordinary Differential Equations.NeurIPS

work page 2018

[8] [8]

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow Matching for Generative Modeling.ICLR

work page 2023

[9] [9]

Chen, R. T. Q., and Lipman, Y. (2024). Flow Matching on General Geometries.ICLR

work page 2024

[10] [10]

Liu, X., Gong, C., and Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.ICLR. 14

work page 2023

[11] [11]

Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. (2024). Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport.TMLR

work page 2024

[12] [12]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic Interpolants: A Unifying Framework for Flows and Diffusions.arXiv:2303.08797

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models.NeurIPS

work page 2020

[14] [14]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations.ICLR

work page 2021

[15] [15]

Q., and Dhariwal, P

Nichol, A. Q., and Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML

work page 2021

[16] [16]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.CVPR

work page 2022

[17] [17]

Zhu, Y., Lan, Y., and Cheng, X. (2024). Segment-Level Diffusion for Long-Form Text Generation. arXiv preprint

work page 2024

[18] [18]

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. (2022). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.ICLR

work page 2022

[19] [19]

L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T

Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. B. (2022). Diffusion-LM Improves Controllable Text Generation.NeurIPS

work page 2022

[20] [20]

Lovelace, J., Kishore, V., Wan, C., Shekhtman, E., and Weinberger, K. Q. (2023). Latent Diffusion for Language Generation.NeurIPS

work page 2023

[21] [21]

He, Z., Sun, T., Wang, K., Huang, X., Qiu, X., and Tang, Q. (2023). DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models.ACL

work page 2023

[22] [22]

D., Ho, J., Tarlow, D., and van den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces.NeurIPS

work page 2021

[23] [23]

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. (2023). DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models.ICLR

work page 2023

[24] [24]

Continuous diffusion for categorical data

Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., Hawthorne, C., Leblond, R., Grathwohl, W., and Adler, J. (2022). Continuous Diffusion for Categorical Data.arXiv:2211.15089

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. (2018). Non-Autoregressive Neural Machine Translation.ICLR

work page 2018

[26] [26]

Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. (2019). Mask-Predict: Parallel Decoding of Conditional Masked Language Models.EMNLP-IJCNLP

work page 2019

[27] [27]

Gu, J., Wang, C., and Zhao, J. (2019). Levenshtein Transformer.NeurIPS

work page 2019

[28] [28]

Stern, M., Chan, W., Kiros, J., and Uszkoreit, J. (2019). Insertion Transformer: Flexible Sequence Generation via Insertion Operations.ICML

work page 2019

[29] [29]

Ghazvininejad, M., Karpukhin, V., Zettlemoyer, L., and Levy, O. (2020). Aligned Cross Entropy for Non-Autoregressive Machine Translation.ICML. 15

work page 2020

[30] [30]

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). The Curious Case of Neural Text Degeneration.ICLR

work page 2020

[31] [31]

R., Sun, Q., Lee, S., Crandall, D., and Batra, D

Vijayakumar, A., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. (2018). Diverse Beam Search for Improved Description of Complex Scenes.AAAI

work page 2018

[32] [32]

Meshchaninov, V., Chimbulatov, E., Shabalin, A., Abramov, A., and Vetrov, D. (2025). COSMOS: Compressed and Smooth Latent Space for Text Diffusion Modeling. InAdvances in Neural Information Processing Systems 38 (NeurIPS). 16 Table 6: Selected qualitative samples.Refis the ground-truth continuation;Predis the system output. Substitution errors are shown i...

work page 2025