Recognition: no theorem link
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
Pith reviewed 2026-05-12 03:01 UTC · model grok-4.3
The pith
Primal-dual guided decoding enforces global constraints during discrete diffusion sampling by adding an optimal KL-regularized bias to token logits at each step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to,
What carries the argument
The additive, constraint-dependent bias applied to token logits, obtained as the optimal KL-regularised projection of the constraint and driven by online mirror-descent updates on Lagrangian multipliers.
Load-bearing premise
Constraint violations can be reliably measured and scored from partial sequences at each denoising step, and the mirror-descent updates on multipliers converge to feasible solutions without substantially harming sample quality or diversity.
What would settle it
Measure constraint satisfaction rates and diversity metrics on a fixed model and task (such as molecular property constraints) both with and without the primal-dual logit bias, checking whether the observed violation stays inside the paper's formal bounds.
Figures
read the original abstract
Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model's unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces primal-dual guided decoding, an inference-time algorithm for constrained discrete diffusion that casts generation as online KL-regularized optimization. At each denoising step it adds an adaptive bias (derived as the optimal projection of the constraint) to the model's logits and updates Lagrangian multipliers via mirror descent on observed violations; the approach requires no retraining, handles multiple constraints simultaneously, supplies formal violation bounds, and is instantiated with domain-specific scorers on topical text, molecular design, and playlist generation tasks.
Significance. If the optimality derivation and bounds hold under the paper's partial-sequence scoring assumptions, the result would be a notable contribution: a training-free, theoretically grounded method for enforcing global constraints inside the diffusion sampling loop that preserves proximity to the base model and generalizes across modalities without extra model calls.
major comments (2)
- [Abstract / §3] Abstract and §3 (method derivation): the central claim that the additive bias equals the exact KL-regularized projection of the constraint (and therefore yields formal violation bounds) presupposes that the violation function g(x_t, c) can be evaluated exactly on every intermediate masked sequence x_t. For the three evaluated domains, constraints such as molecular validity and playlist coherence are only well-defined on complete sequences; any heuristic or proxy scorer on partial tokens introduces approximation error that breaks the inner-projection optimality and invalidates the subgradient information used by the mirror-descent multiplier updates.
- [§4] §4 (theoretical analysis): the manuscript should state explicitly whether the formal bounds continue to hold when g is replaced by an approximate partial-sequence scorer, and should provide either a proof that the approximation error remains controlled or an empirical quantification of how much the realized violation deviates from the claimed bound.
minor comments (2)
- [Abstract] The abstract states that the method 'provides formal bounds on constraint violation' yet does not reference the specific theorem or proposition number; adding the citation would improve readability.
- [§5] Implementation details for the domain-specific scoring functions (e.g., how partial-sequence molecular validity is approximated) are mentioned only at a high level; a short appendix table listing the exact proxy functions used would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight an important distinction between the exact theoretical setting and the practical use of proxy scorers. We respond to each major comment below and commit to revisions that clarify assumptions and add supporting analysis.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (method derivation): the central claim that the additive bias equals the exact KL-regularized projection of the constraint (and therefore yields formal violation bounds) presupposes that the violation function g(x_t, c) can be evaluated exactly on every intermediate masked sequence x_t. For the three evaluated domains, constraints such as molecular validity and playlist coherence are only well-defined on complete sequences; any heuristic or proxy scorer on partial tokens introduces approximation error that breaks the inner-projection optimality and invalidates the subgradient information used by the mirror-descent multiplier updates.
Authors: We agree that the derivation in §3 and the formal bounds rest on the assumption that g(x_t, c) is exactly evaluable on partial sequences. Under this assumption the additive bias is the exact KL-regularized projection and the mirror-descent updates receive exact subgradient information. In the reported experiments we employ domain-specific proxy scorers (partial validity heuristics for molecules, prefix coherence scores for playlists) precisely because the true constraints are only defined on complete sequences. These proxies introduce approximation error, so the strict optimality and the exact subgradient property do not hold in the experimental instantiations. The empirical improvements in constraint satisfaction nevertheless remain, indicating practical utility even under approximation. We will revise the abstract and §3 to state the exact-evaluation assumption explicitly and to note that the reported results rely on proxies. revision: yes
-
Referee: [§4] §4 (theoretical analysis): the manuscript should state explicitly whether the formal bounds continue to hold when g is replaced by an approximate partial-sequence scorer, and should provide either a proof that the approximation error remains controlled or an empirical quantification of how much the realized violation deviates from the claimed bound.
Authors: We accept that the formal bounds in §4 are proved only for exact g. With approximate scorers the bounds do not hold in general without further assumptions on the approximation error. We will revise §4 to state this limitation clearly. In addition, we will add an empirical quantification: for each experimental domain we will report the observed final violation values alongside the theoretical bound that would apply under exact g, thereby showing the magnitude of deviation introduced by the proxies. This empirical comparison will be included in the revised §4 and in the experimental tables. revision: yes
Circularity Check
No significant circularity; derivation applies standard optimization primitives
full rationale
The paper formulates constrained sampling as a KL-regularized projection solved via Lagrangian multipliers and mirror descent, with the additive bias derived directly as the solution to that inner optimization problem at each denoising step. This is an application of external convex-optimization results (KL projection, subgradient updates) to the diffusion process rather than a self-referential definition or a fitted parameter renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the abstract or described derivation chain. The central claim therefore remains independent of its own outputs and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- mirror descent step size
axioms (2)
- domain assumption Constrained generation at each denoising step can be formulated as a KL-regularized optimization problem whose solution yields an additive logit bias.
- domain assumption Constraint violations can be evaluated from partially denoised sequences and used to drive mirror-descent updates on Lagrangian multipliers.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Structured Denoising Diffusion Models in Discrete State-Spaces , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[2]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Simple and Effective Masked Diffusion Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[3]
Incremental Sequence Classification with Temporal Consistency , author=. 2025 , eprint=
work page 2025
-
[4]
Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C and Friederich, Pascal and Gaudin, Th. Patterns , volume =
-
[5]
Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , booktitle =
-
[6]
Quantifying the Chemical Beauty of Drugs , author =. Nature Chemistry , volume =
-
[7]
Irwin, John J and Tang, Khanh G and Young, Jennifer and Dandarchuluun, Chinzorig and Wong, Benjamin R and Khurelbaatar, Munkhzul and Moroz, Yurii S and Mayfield, John and Sayle, Roger A , journal =
-
[8]
International Conference on Machine Learning (ICML) , year =
Dual Mirror Descent for Online Allocation Problems , author =. International Conference on Machine Learning (ICML) , year =
-
[9]
Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =
-
[10]
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages =
work page 2018
-
[11]
Lu, Ximing and Welleck, Sean and West, Peter and Jiang, Liwei and Kasai, Jungo and Khashabi, Daniel and Le Bras, Ronan and Qin, Lianhui and Yu, Youngjae and Zellers, Rowan and Smith, Noah A and Choi, Yejin , booktitle =
-
[12]
Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion Models Beat
-
[13]
International Conference on Learning Representations (ICLR) , year =
Unlocking Guidance for Discrete State-Space Diffusion and Flow Models , author =. International Conference on Learning Representations (ICLR) , year =
-
[14]
International Conference on Learning Representations (ICLR) , year =
Simple Guidance Mechanisms for Discrete Diffusion Models , author =. International Conference on Learning Representations (ICLR) , year =
-
[15]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Constrained Discrete Diffusion , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[16]
Eldan, Ronen and Li, Yuanzhi , journal =
-
[17]
International Conference on Machine Learning (ICML) , year =
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author =. International Conference on Machine Learning (ICML) , year =
-
[18]
NeurIPS Workshop on Deep Generative Models and Downstream Applications , year =
Classifier-Free Diffusion Guidance , author =. NeurIPS Workshop on Deep Generative Models and Downstream Applications , year =
-
[19]
International Conference on Learning Representations (ICLR) , year =
Diffusion Posterior Sampling for General Noisy Inverse Problems , author =. International Conference on Learning Representations (ICLR) , year =
-
[20]
Yang, Kevin and Klein, Dan , booktitle =
-
[21]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Torsional Diffusion for Molecular Conformer Generation , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[22]
International Conference on Learning Representations (ICLR) , year =
Vignac, Cl. International Conference on Learning Representations (ICLR) , year =
-
[23]
Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Al. Molecular Sets (. Frontiers in...
work page 2020
-
[24]
Bryson, Arthur E. and Ho, Yu-Chi , title =. 1975 , edition =
work page 1975
-
[25]
Conference on Learning Theory (COLT) , year =
Online Learning with Predictable Sequences , author =. Conference on Learning Theory (COLT) , year =
-
[26]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Large Language Diffusion Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[27]
International Conference on Machine Learning (ICML) , year =
Diffusion Language Models Are Versatile Protein Learners , author =. International Conference on Machine Learning (ICML) , year =
-
[28]
Protein generation with evolutionary diffusion: sequence is all you need , author =. bioRxiv , year =
-
[29]
International Conference on Learning Representations (ICLR) , year =
A Distributional Approach to Controlled Text Generation , author =. International Conference on Learning Representations (ICLR) , year =
-
[30]
International Conference on Learning Representations (ICLR) , year =
The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =
-
[31]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =
-
[32]
Advances in Neural Information Processing Systems , volume=
Recommender systems with generative retrieval , author=. Advances in Neural Information Processing Systems , volume=
-
[33]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Protein generation with evolutionary diffusion: sequence is all you need
Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Xijie Lu, Nicolo Fusi, Ava P Amini, and Kevin K Yang. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, 2023. doi:10.1101/2023.09.11.556673
-
[35]
Structured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[36]
Dual mirror descent for online allocation problems
Santiago R Balseiro, Haihao Lu, and Vahab S Mirrokni. Dual mirror descent for online allocation problems. In International Conference on Machine Learning (ICML), 2020
work page 2020
-
[37]
Arthur E. Bryson and Yu-Chi Ho. Applied Optimal Control: Optimization, Estimation, and Control. Hemisphere Publishing, Washington, DC, revised edition, 1975
work page 1975
-
[38]
Constrained discrete diffusion
Michael Cardei, Jacob K Christopher, Bhavya Kailkhura, Thomas Hartvigsen, and Ferdinando Fioretto. Constrained discrete diffusion. In Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[39]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[40]
Tinystories: How small can language models be and still speak coherent english?
Ronen Eldan and Yuanzhi Li. TinyStories : How small can language models be and still speak coherent English ? arXiv preprint arXiv:2305.07759, 2023
-
[41]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[42]
Lexically constrained decoding for sequence generation using grid beam search
Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1535--1546, 2017
work page 2017
-
[43]
ZINC20 -- a free ultralarge-scale chemical database for ligand discovery
John J Irwin, Khanh G Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R Wong, Munkhzul Khurelbaatar, Yurii S Moroz, John Mayfield, and Roger A Sayle. ZINC20 -- a free ultralarge-scale chemical database for ligand discovery. Journal of Chemical Information and Modeling, 60 0 (12): 0 6065--6073, 2020
work page 2020
-
[44]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning (ICML), 2024
work page 2024
-
[45]
NeuroLogic A*esque decoding: Constrained text generation with lookahead heuristics
Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A Smith, and Yejin Choi. NeuroLogic A*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Li...
work page 2022
-
[46]
Incremental sequence classification with temporal consistency, 2025
Lucas Maystre, Gabriel Barello, Tudor Berariu, Aleix Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, and David Barber. Incremental sequence classification with temporal consistency, 2025. URL https://arxiv.org/abs/2505.16548
-
[47]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/2502.09992. Oral presentation
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Unlocking guidance for discrete state-space diffusion and flow models
Hunter Nisonoff, Junhao Xiong, Stephan Allenspach, and Jennifer Listgarten. Unlocking guidance for discrete state-space diffusion and flow models. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[49]
Fast lexically constrained decoding with dynamic beam allocation for neural machine translation
Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1314--1324, 2018
work page 2018
-
[50]
Recommender systems with generative retrieval
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36: 0 10299--10315, 2023
work page 2023
-
[51]
Online learning with predictable sequences
Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory (COLT), 2013. URL https://arxiv.org/abs/1208.3728
-
[52]
Simple and effective masked diffusion language models
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[53]
de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov
Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-Torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[54]
Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024
Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. In International Conference on Machine Learning (ICML), 2024. URL https://arxiv.org/abs/2402.18567
-
[55]
FUDGE : Controlled text generation with future discriminators
Kevin Yang and Dan Klein. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.