arxiv: 2605.09749 · v1 · submitted 2026-05-10 · 💻 cs.AI

Recognition: no theorem link

Primal-Dual Guided Decoding for Constrained Discrete Diffusion

Federico Tomasi , Dmitrii Moor , Alice Wang , Mounia Lalmas

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords discrete diffusionconstrained generationprimal-dual optimizationinference-time guidanceLagrangian multipliersKL regularizationtoken logitsmirror descent

0 comments

The pith

Primal-dual guided decoding enforces global constraints during discrete diffusion sampling by adding an optimal KL-regularized bias to token logits at each step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that global constraints on generated sequences, such as topical coherence or molecular validity, can be enforced at inference time in discrete diffusion models without retraining. It formulates the problem as a KL-regularized optimization solved online with adaptive Lagrangian multipliers that are updated by mirror descent based on measured violations from partial sequences. This produces an additive bias to the model's token logits that projects the constraint onto the distribution while staying as close as possible to the unconstrained model output. A reader would care because the approach works for multiple constraints at once, requires only standard sampling passes, and supplies formal bounds on how much violation can remain.

Core claim

What carries the argument

The additive, constraint-dependent bias applied to token logits, obtained as the optimal KL-regularised projection of the constraint and driven by online mirror-descent updates on Lagrangian multipliers.

Load-bearing premise

Constraint violations can be reliably measured and scored from partial sequences at each denoising step, and the mirror-descent updates on multipliers converge to feasible solutions without substantially harming sample quality or diversity.

What would settle it

Measure constraint satisfaction rates and diversity metrics on a fixed model and task (such as molecular property constraints) both with and without the primal-dual logit bias, checking whether the observed violation stays inside the paper's formal bounds.

Figures

Figures reproduced from arXiv: 2605.09749 by Alice Wang, Dmitrii Moor, Federico Tomasi, Mounia Lalmas.

**Figure 2.** Figure 2: Constraint difficulty sweep (R ∈ {5, 10, 15, 20}, η= 0.5, N = 1,000). Accumulated slack stays closest to the target and lowest-KL; optimistic and instantaneous saturate pass rate from R= 10 at higher KL. is too small to drive λ off its floor and the method barely fires (11.5% pass), whereas at R= 20 the target is large enough that even the conservative ramp satisfies the constraint (99.5% pass, c¯= 16.9). … view at source ↗

**Figure 3.** Figure 3: Text Pareto front: unigram KL vs. unconstrained against Pass% (#ocean tokens [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Logit-increment distribution on TinyStories MDLM. Top row: raw [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Raw backbone log-probabilities log pθ(x ℓ t = j | xt+1) for the top-3 tokens at each of L = 32 positions, across T = 96 denoising steps (one colour per position; transparency encodes rank within the position). Logits evolve smoothly, with no sharp transitions; we plot pre-parameterisation values, bypassing the SUBS rewrite that forces unmasked positions to identity. Different positions cluster at different… view at source ↗

**Figure 6.** Figure 6: Vocabulary-wide logit evolution for eight representative positions. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Molecular SMILES MDLM: cross-position covariance of centred [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Per-step objective πt under SPDD ocean steering (L=128, T=96, N=4, R=5, η=0.5, λ0=0.5). Top: empirical reward (blue) and Statement 2 lower bound (red, dashed) over denoising steps. Bottom: slack (green = bound holds; red = bound violated) and λt trajectory (orange). The bound holds at 96.9% of steps; total empirical reward exceeds the bound by 27%. The dynamics concentrate in the first ∼15 steps; after con… view at source ↗

**Figure 9.** Figure 9: Example molecules from the 21.9M SMILES backbone. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Per-step dynamics during playlist generation (reggaeton, [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model's unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical inference-time way to add constraints to discrete diffusion via primal-dual logit adjustment, but the formal optimality and bounds rest on accurate violation scoring from partial sequences.

read the letter

The main point is an inference-time method that adjusts token logits during discrete diffusion denoising using primal-dual optimization and mirror descent on Lagrangian multipliers. It avoids retraining, supports multiple constraints at once, and claims to keep samples close to the base model while meeting the constraints through an additive bias derived from the KL-regularized projection. They test it on topical text generation, molecular design, and music playlist generation with domain-specific scorers, reporting gains in constraint satisfaction without major drops in quality metrics. That combination is the concrete contribution worth noting.

Referee Report

2 major / 2 minor

Summary. The paper introduces primal-dual guided decoding, an inference-time algorithm for constrained discrete diffusion that casts generation as online KL-regularized optimization. At each denoising step it adds an adaptive bias (derived as the optimal projection of the constraint) to the model's logits and updates Lagrangian multipliers via mirror descent on observed violations; the approach requires no retraining, handles multiple constraints simultaneously, supplies formal violation bounds, and is instantiated with domain-specific scorers on topical text, molecular design, and playlist generation tasks.

Significance. If the optimality derivation and bounds hold under the paper's partial-sequence scoring assumptions, the result would be a notable contribution: a training-free, theoretically grounded method for enforcing global constraints inside the diffusion sampling loop that preserves proximity to the base model and generalizes across modalities without extra model calls.

major comments (2)

[Abstract / §3] Abstract and §3 (method derivation): the central claim that the additive bias equals the exact KL-regularized projection of the constraint (and therefore yields formal violation bounds) presupposes that the violation function g(x_t, c) can be evaluated exactly on every intermediate masked sequence x_t. For the three evaluated domains, constraints such as molecular validity and playlist coherence are only well-defined on complete sequences; any heuristic or proxy scorer on partial tokens introduces approximation error that breaks the inner-projection optimality and invalidates the subgradient information used by the mirror-descent multiplier updates.
[§4] §4 (theoretical analysis): the manuscript should state explicitly whether the formal bounds continue to hold when g is replaced by an approximate partial-sequence scorer, and should provide either a proof that the approximation error remains controlled or an empirical quantification of how much the realized violation deviates from the claimed bound.

minor comments (2)

[Abstract] The abstract states that the method 'provides formal bounds on constraint violation' yet does not reference the specific theorem or proposition number; adding the citation would improve readability.
[§5] Implementation details for the domain-specific scoring functions (e.g., how partial-sequence molecular validity is approximated) are mentioned only at a high level; a short appendix table listing the exact proxy functions used would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight an important distinction between the exact theoretical setting and the practical use of proxy scorers. We respond to each major comment below and commit to revisions that clarify assumptions and add supporting analysis.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method derivation): the central claim that the additive bias equals the exact KL-regularized projection of the constraint (and therefore yields formal violation bounds) presupposes that the violation function g(x_t, c) can be evaluated exactly on every intermediate masked sequence x_t. For the three evaluated domains, constraints such as molecular validity and playlist coherence are only well-defined on complete sequences; any heuristic or proxy scorer on partial tokens introduces approximation error that breaks the inner-projection optimality and invalidates the subgradient information used by the mirror-descent multiplier updates.

Authors: We agree that the derivation in §3 and the formal bounds rest on the assumption that g(x_t, c) is exactly evaluable on partial sequences. Under this assumption the additive bias is the exact KL-regularized projection and the mirror-descent updates receive exact subgradient information. In the reported experiments we employ domain-specific proxy scorers (partial validity heuristics for molecules, prefix coherence scores for playlists) precisely because the true constraints are only defined on complete sequences. These proxies introduce approximation error, so the strict optimality and the exact subgradient property do not hold in the experimental instantiations. The empirical improvements in constraint satisfaction nevertheless remain, indicating practical utility even under approximation. We will revise the abstract and §3 to state the exact-evaluation assumption explicitly and to note that the reported results rely on proxies. revision: yes
Referee: [§4] §4 (theoretical analysis): the manuscript should state explicitly whether the formal bounds continue to hold when g is replaced by an approximate partial-sequence scorer, and should provide either a proof that the approximation error remains controlled or an empirical quantification of how much the realized violation deviates from the claimed bound.

Authors: We accept that the formal bounds in §4 are proved only for exact g. With approximate scorers the bounds do not hold in general without further assumptions on the approximation error. We will revise §4 to state this limitation clearly. In addition, we will add an empirical quantification: for each experimental domain we will report the observed final violation values alongside the theoretical bound that would apply under exact g, thereby showing the magnitude of deviation introduced by the proxies. This empirical comparison will be included in the revised §4 and in the experimental tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard optimization primitives

full rationale

The paper formulates constrained sampling as a KL-regularized projection solved via Lagrangian multipliers and mirror descent, with the additive bias derived directly as the solution to that inner optimization problem at each denoising step. This is an application of external convex-optimization results (KL projection, subgradient updates) to the diffusion process rather than a self-referential definition or a fitted parameter renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the abstract or described derivation chain. The central claim therefore remains independent of its own outputs and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on standard optimization assumptions applied to diffusion sampling; no new physical or mathematical entities are introduced.

free parameters (1)

mirror descent step size
Hyperparameter controlling multiplier updates; not specified in abstract but required for the online optimization.

axioms (2)

domain assumption Constrained generation at each denoising step can be formulated as a KL-regularized optimization problem whose solution yields an additive logit bias.
Central modeling choice that enables the primal-dual approach.
domain assumption Constraint violations can be evaluated from partially denoised sequences and used to drive mirror-descent updates on Lagrangian multipliers.
Required for the online adaptive mechanism to function.

pith-pipeline@v0.9.0 · 5463 in / 1442 out tokens · 84617 ms · 2026-05-12T03:01:50.018994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Structured Denoising Diffusion Models in Discrete State-Spaces , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[2]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Simple and Effective Masked Diffusion Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[3]

2025 , eprint=

Incremental Sequence Classification with Temporal Consistency , author=. 2025 , eprint=

work page 2025
[4]

Patterns , volume =

Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C and Friederich, Pascal and Gaudin, Th. Patterns , volume =

work page
[5]

Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , booktitle =

work page
[6]

Nature Chemistry , volume =

Quantifying the Chemical Beauty of Drugs , author =. Nature Chemistry , volume =

work page
[7]

Irwin, John J and Tang, Khanh G and Young, Jennifer and Dandarchuluun, Chinzorig and Wong, Benjamin R and Khurelbaatar, Munkhzul and Moroz, Yurii S and Mayfield, John and Sayle, Roger A , journal =

work page
[8]

International Conference on Machine Learning (ICML) , year =

Dual Mirror Descent for Online Allocation Problems , author =. International Conference on Machine Learning (ICML) , year =

work page
[9]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

work page
[10]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages =

Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages =

work page 2018
[11]

Lu, Ximing and Welleck, Sean and West, Peter and Jiang, Liwei and Kasai, Jungo and Khashabi, Daniel and Le Bras, Ronan and Qin, Lianhui and Yu, Youngjae and Zellers, Rowan and Smith, Noah A and Choi, Yejin , booktitle =

work page
[12]

Diffusion Models Beat

Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion Models Beat

work page
[13]

International Conference on Learning Representations (ICLR) , year =

Unlocking Guidance for Discrete State-Space Diffusion and Flow Models , author =. International Conference on Learning Representations (ICLR) , year =

work page
[14]

International Conference on Learning Representations (ICLR) , year =

Simple Guidance Mechanisms for Discrete Diffusion Models , author =. International Conference on Learning Representations (ICLR) , year =

work page
[15]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Constrained Discrete Diffusion , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[16]

Eldan, Ronen and Li, Yuanzhi , journal =

work page
[17]

International Conference on Machine Learning (ICML) , year =

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author =. International Conference on Machine Learning (ICML) , year =

work page
[18]

NeurIPS Workshop on Deep Generative Models and Downstream Applications , year =

Classifier-Free Diffusion Guidance , author =. NeurIPS Workshop on Deep Generative Models and Downstream Applications , year =

work page
[19]

International Conference on Learning Representations (ICLR) , year =

Diffusion Posterior Sampling for General Noisy Inverse Problems , author =. International Conference on Learning Representations (ICLR) , year =

work page
[20]

Yang, Kevin and Klein, Dan , booktitle =

work page
[21]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Torsional Diffusion for Molecular Conformer Generation , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[22]

International Conference on Learning Representations (ICLR) , year =

Vignac, Cl. International Conference on Learning Representations (ICLR) , year =

work page
[23]

Molecular Sets (

Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Al. Molecular Sets (. Frontiers in...

work page 2020
[24]

and Ho, Yu-Chi , title =

Bryson, Arthur E. and Ho, Yu-Chi , title =. 1975 , edition =

work page 1975
[25]

Conference on Learning Theory (COLT) , year =

Online Learning with Predictable Sequences , author =. Conference on Learning Theory (COLT) , year =

work page
[26]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Large Language Diffusion Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[27]

International Conference on Machine Learning (ICML) , year =

Diffusion Language Models Are Versatile Protein Learners , author =. International Conference on Machine Learning (ICML) , year =

work page
[28]

bioRxiv , year =

Protein generation with evolutionary diffusion: sequence is all you need , author =. bioRxiv , year =

work page
[29]

International Conference on Learning Representations (ICLR) , year =

A Distributional Approach to Controlled Text Generation , author =. International Conference on Learning Representations (ICLR) , year =

work page
[30]

International Conference on Learning Representations (ICLR) , year =

The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =

work page
[31]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =

work page
[32]

Advances in Neural Information Processing Systems , volume=

Recommender systems with generative retrieval , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Protein generation with evolutionary diffusion: sequence is all you need

Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Xijie Lu, Nicolo Fusi, Ava P Amini, and Kevin K Yang. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, 2023. doi:10.1101/2023.09.11.556673

work page doi:10.1101/2023.09.11.556673 2023
[35]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[36]

Dual mirror descent for online allocation problems

Santiago R Balseiro, Haihao Lu, and Vahab S Mirrokni. Dual mirror descent for online allocation problems. In International Conference on Machine Learning (ICML), 2020

work page 2020
[37]

Bryson and Yu-Chi Ho

Arthur E. Bryson and Yu-Chi Ho. Applied Optimal Control: Optimization, Estimation, and Control. Hemisphere Publishing, Washington, DC, revised edition, 1975

work page 1975
[38]

Constrained discrete diffusion

Michael Cardei, Jacob K Christopher, Bhavya Kailkhura, Thomas Hartvigsen, and Ferdinando Fioretto. Constrained discrete diffusion. In Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[39]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[40]

Tinystories: How small can language models be and still speak coherent english?

Ronen Eldan and Yuanzhi Li. TinyStories : How small can language models be and still speak coherent English ? arXiv preprint arXiv:2305.07759, 2023

work page arXiv 2023
[41]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

work page 2021
[42]

Lexically constrained decoding for sequence generation using grid beam search

Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1535--1546, 2017

work page 2017
[43]

ZINC20 -- a free ultralarge-scale chemical database for ligand discovery

John J Irwin, Khanh G Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R Wong, Munkhzul Khurelbaatar, Yurii S Moroz, John Mayfield, and Roger A Sayle. ZINC20 -- a free ultralarge-scale chemical database for ligand discovery. Journal of Chemical Information and Modeling, 60 0 (12): 0 6065--6073, 2020

work page 2020
[44]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning (ICML), 2024

work page 2024
[45]

NeuroLogic A*esque decoding: Constrained text generation with lookahead heuristics

Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A Smith, and Yejin Choi. NeuroLogic A*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Li...

work page 2022
[46]

Incremental sequence classification with temporal consistency, 2025

Lucas Maystre, Gabriel Barello, Tudor Berariu, Aleix Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, and David Barber. Incremental sequence classification with temporal consistency, 2025. URL https://arxiv.org/abs/2505.16548

work page arXiv 2025
[47]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/2502.09992. Oral presentation

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Unlocking guidance for discrete state-space diffusion and flow models

Hunter Nisonoff, Junhao Xiong, Stephan Allenspach, and Jennifer Listgarten. Unlocking guidance for discrete state-space diffusion and flow models. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[49]

Fast lexically constrained decoding with dynamic beam allocation for neural machine translation

Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1314--1324, 2018

work page 2018
[50]

Recommender systems with generative retrieval

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36: 0 10299--10315, 2023

work page 2023
[51]

Online learning with predictable sequences

Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory (COLT), 2013. URL https://arxiv.org/abs/1208.3728

work page arXiv 2013
[52]

Simple and effective masked diffusion language models

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[53]

de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-Torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[54]

Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. In International Conference on Machine Learning (ICML), 2024. URL https://arxiv.org/abs/2402.18567

work page arXiv 2024
[55]

FUDGE : Controlled text generation with future discriminators

Kevin Yang and Dan Klein. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021

work page 2021