pith. sign in

arxiv: 2606.28301 · v1 · pith:GRFS4NT7new · submitted 2026-06-26 · 💻 cs.LG · cs.DS· cs.NA· math.NA· math.PR· stat.ML

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Pith reviewed 2026-06-29 04:01 UTC · model grok-4.3

classification 💻 cs.LG cs.DScs.NAmath.NAmath.PRstat.ML
keywords masked diffusion modelstest-time scalingreward-guided samplingbacktracking Markov chainconstraint satisfactionsample editingquadratic complexityverifier robustness
0
0 comments X

The pith

MDM-VGB extends backtracking Markov chains to masked-state graphs for quadratic-complexity reward-guided sampling in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MDM-VGB as a sampler for masked diffusion models that adds reward-guided remasking to standard unmasking steps. It adapts the classical Jerrum-Sinclair backtracking random walk so the chain operates on the full graph of partially masked sequences rather than a fixed prefix tree. The reward tilts the walk toward higher-value partial states, which supports both generating new high-reward outputs and repairing existing low-reward ones. The construction is proved to keep quadratic complexity and to stay robust when the verifier that supplies the reward is noisy. This matters for applying diffusion models to tasks that impose structural constraints or downstream objectives where plain sampling is inefficient.

Core claim

MDM-VGB extends the Jerrum-Sinclair backtracking Markov chain from a fixed prefix tree to an arbitrary masked-state graph and tilts the walk with the reward, so that unmasking and remasking moves favor higher-value partial configurations. The resulting sampler achieves quadratic complexity, remains robust to process-verifier noise, and enables both high-reward generation and efficient repair of low-reward samples, whereas best-of-N incurs exponential complexity from error accumulation.

What carries the argument

Reward-tilted backtracking random walk on the masked-state graph, which permits unmasking and remasking at arbitrary token positions.

If this is right

  • High-reward samples can be produced with quadratic rather than exponential cost relative to the number of tokens.
  • Low-reward outputs can be repaired by selectively remasking and re-unmasking positions.
  • The sampler stays effective even when the process verifier that guides the reward is noisy.
  • Performance advantages appear on constraint-satisfaction benchmarks such as Sudoku and on molecular property tasks such as QM9.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The masked-state graph construction may transfer to other discrete generative models that expose partial states during sampling.
  • Quadratic scaling could allow hybrid training-plus-inference loops that optimize non-differentiable rewards without full retraining.
  • If the mixing guarantees extend to other reward functions, similar backtracking could improve test-time methods beyond diffusion.

Load-bearing premise

Extending the Jerrum-Sinclair backtracking Markov chain from a fixed prefix tree to an arbitrary masked-state graph preserves its mixing and robustness properties when the reward is used to tilt the walk.

What would settle it

An experiment on Sudoku or QM9 showing that the number of steps needed for MDM-VGB to reach a target reward level grows exponentially with sequence length under increasing verifier noise.

Figures

Figures reproduced from arXiv: 2606.28301 by Kijung Jeon, Molei Tao, Thuy-Duong Vuong.

Figure 1
Figure 1. Figure 1: Visualization of MDM-VGB for generation and editing. The depth of a state is the number [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Masked-state graph vs. prefix tree. MDM-VGB operates on the any-order masked-state graph, which allows arbitrary-coordinate re-masking. AR-VGB operates on a fixed-order prefix tree, where backtracking only removes the latest revealed token. Masked states graph. Consider a masked state z ∈ Z. For B ⊆ [n] \ R(z) and aB ∈ VB, we call z B→aB a forward child of z; let C(z) be the set of children of z. For B ⊆ R… view at source ↗
Figure 3
Figure 3. Figure 3: MDM-VGB and MDM-VGB-Momentum. MDM-VGB alternates value-guided reveal and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generation quality–cost frontiers for Sudoku, QM9, DNA, and Protein. Solid curves [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: QM9 unique Pass@95: fraction of distinct generated molecules satisfying Pass@95 versus per-sample compute [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Editing on QM9, DNA, and Protein. Initial configurations are grouped by seed reward [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MDM-VGB can edit an early mistake directly, while AR-VGB must erase the suffix. The red token marks the local error, and gray tokens denote positions selected for re-masking. AR MDM Method Acc. ↑ Cost ↓ Acc. ↑ Cost ↓ VGB 25.64% 127.63 99.41% 29.98 +Momentum 80.92% 75.50 97.54% 26.24 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dyck grammar editing with varying re-masking strength λ. Moder￾ate re-masking strength improves edit￾ing accuracy while reducing the average number of moves. We highlight two main ablations: one on the re-masking parameter λ, which controls how strongly re-masking de￾cisions are guided by reward values, and one on verifier model size. Additional ablations, including block size and shortlisting rule, are pr… view at source ↗
Figure 9
Figure 9. Figure 9: DNA verifier size tradeoff. Larger verifiers improve Pass@95 at the cost of additional verifier inference. Verifier size and amortized cost. We ablate the size of the learned DNA verifier in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Roadmap of the VGB variants studied in the appendix. AR-VGB and its momentum variant [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of one geometric balanced AOAR-VGB transition from a masked state [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Flow cancellation at state z. For a fixed masked state z, the two lifted copies are (z, ↓) and (z, ↑). If both momentum modes assign positive prob￾ability to switching between these two copies, the smaller of the two switch probabilities is merely a symmetric back-and-forth exchange. We remove this common ex￾change and turn it into same-copy holding probability, leaving only the residual imbalance as an a… view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of the momentum lift for geometric balanced AOAR-VGB. Forward moves [PITH_FULL_IMAGE:figures/full_fig_p051_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of one geometric balanced MDM-VGB transition. Forward moves reveal an [PITH_FULL_IMAGE:figures/full_fig_p062_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Illustration of the flow-cancelled momentum lift for geometric MDM-VGB. The downward [PITH_FULL_IMAGE:figures/full_fig_p065_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Shortlisting-budget ablation for MDM-VGB-MOMENTUM on QM9 and DNA. Increasing Lf = Lb helps up to a moderate budget, after which quality saturates while adjusted NFE can continue to grow. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: QM9 root-start MDM-VGB-MOMENTUM block-size ablation under a maximum budget corresponding to N = 16, with Lf = Lb = 4, K = 4, and λ = 4. The left panel reports adjusted NFE, including verifier FLOPs, and the right panel reports Pass@95. Moderate block sizes improve the compute–quality tradeoff, whereas overly large blocks increase cost and degrade reward satisfaction. 72 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
read the original abstract

Inference-time scaling is a promising paradigm to improve generative models, especially when outputs must satisfy structural constraints or optimize downstream rewards. We consider Masked Diffusion Model (MDM) and introduce MDM-VGB, a discrete diffusion sampler that augments unmasking generation with theoretically principled reward-guided remasking. Inspired by the recent success of the classical Jerrum-Sinclair backtracking Markov chain in reward-tilted generation, MDM-VGB extends the backtracking random walk from a fixed prefix tree to a masked-state graph, allowing tokens to be unmasked and remasked at arbitrary positions. The resulting sampler favors unmasking and remasking moves that lead to higher-value partial configurations, enabling both effective high-reward generation and efficient repair of low-reward samples. We prove that MDM-VGB is robust to process-verifier noise and achieves quadratic complexity, while popular test-time heuristics such as best-of-$N$ can incur exponential complexity due to error accumulation. Our theoretical findings are corroborated by strong empirical performance, particularly on popular constraint-satisfaction and scientific benchmarks such as Sudoku and QM9.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MDM-VGB, a discrete diffusion sampler for Masked Diffusion Models that augments standard unmasking with reward-guided remasking by extending the Jerrum-Sinclair backtracking Markov chain from a prefix tree to a general masked-state graph. It claims to prove that MDM-VGB is robust to process-verifier noise and achieves quadratic complexity (in contrast to exponential complexity for best-of-N due to error accumulation), while also enabling sample editing; these claims are supported by empirical results on constraint-satisfaction tasks such as Sudoku and molecular benchmarks such as QM9.

Significance. If the central theoretical claims hold, the work would provide a principled, polynomial-time test-time scaling method for reward-driven generation and editing in masked diffusion models, addressing a key limitation of heuristic approaches and offering potential advantages for structured generation tasks in ML and scientific domains.

major comments (2)
  1. [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that the extension of the Jerrum-Sinclair backtracking chain to the masked-state graph preserves quadratic mixing time and noise robustness under reward tilting is asserted without an explicit conductance bound, coupling argument, or stationary-distribution analysis that accounts for the cycles and position-dependent connectivity absent from the original tree case; if remasking moves create low-conductance cuts for non-monotonic rewards, both the quadratic-complexity guarantee and the exponential-vs-quadratic comparison to best-of-N would not follow.
  2. [Empirical evaluation] Empirical evaluation section: the abstract states that theoretical findings are 'corroborated by strong empirical performance' on Sudoku and QM9, yet no specific quantitative results, baselines (including best-of-N variants), metrics, or statistical details are provided in the visible summary; without these, the empirical support for the complexity and robustness claims cannot be assessed.
minor comments (1)
  1. Notation for the masked-state graph and reward-tilted transitions should be defined more explicitly (e.g., transition probabilities and stationary distribution) to aid readability of the theoretical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that the extension of the Jerrum-Sinclair backtracking chain to the masked-state graph preserves quadratic mixing time and noise robustness under reward tilting is asserted without an explicit conductance bound, coupling argument, or stationary-distribution analysis that accounts for the cycles and position-dependent connectivity absent from the original tree case; if remasking moves create low-conductance cuts for non-monotonic rewards, both the quadratic-complexity guarantee and the exponential-vs-quadratic comparison to best-of-N would not follow.

    Authors: We agree that the current theoretical section asserts preservation of quadratic mixing time and noise robustness for the masked-state graph extension but does not supply an explicit conductance bound or coupling argument that fully treats cycles and position-dependent connectivity. The manuscript sketches the graph extension from the tree case but leaves the detailed stationary-distribution analysis implicit. We will add the requested conductance bound, coupling argument, and analysis of potential low-conductance cuts under non-monotonic rewards in the revised theoretical section to substantiate the quadratic-complexity claim and the comparison to best-of-N. revision: yes

  2. Referee: [Empirical evaluation] Empirical evaluation section: the abstract states that theoretical findings are 'corroborated by strong empirical performance' on Sudoku and QM9, yet no specific quantitative results, baselines (including best-of-N variants), metrics, or statistical details are provided in the visible summary; without these, the empirical support for the complexity and robustness claims cannot be assessed.

    Authors: The full empirical evaluation section of the manuscript reports concrete quantitative results on Sudoku (success rates and scaling curves) and QM9 (property scores), with direct comparisons to best-of-N baselines at multiple N values, along with metrics and statistical details that support the quadratic scaling and noise-robustness claims. The abstract summarizes these findings at a high level. Because the referee references only the 'visible summary,' we interpret that the body was not reviewed; we will add a concise summary of key quantitative results to the abstract for completeness. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical claims rest on external classical MCMC result plus claimed extension proof

full rationale

The abstract states that MDM-VGB extends the classical Jerrum-Sinclair backtracking Markov chain (an external 1989 result) to a masked-state graph and proves quadratic complexity plus noise robustness. No equations, fitted parameters, or self-citations are quoted that reduce the claimed properties to inputs by construction. The derivation is presented as self-contained via the extension argument, with no load-bearing self-citation chain or renaming of known results visible.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the masked diffusion process can be modeled as a Markov chain on masked states whose transitions can be tilted by rewards while preserving polynomial mixing; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption The Jerrum-Sinclair backtracking chain properties extend to the masked-state graph under reward tilting.
    Invoked when claiming quadratic complexity and noise robustness for the new sampler.

pith-pipeline@v0.9.1-grok · 5747 in / 1309 out tokens · 42524 ms · 2026-06-29T04:01:59.891768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 2 canonical work pages

  1. [1]

    Protein generation with evolutionary diffusion: sequence is all you need

    Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, Ava Amini, and Kevin Yang. Protein generation with evolutionary diffusion: sequence is all you need. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop, 2023. 10, 70

  2. [2]

    Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  3. [3]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems, volume 34, pages 17981–17993, 2021. 3, 58

  4. [4]

    Quantifying the chemical beauty of drugs.Nature chemistry, 4:90–8, 02 2012

    Richard Bickerton, Gaia Paolini, Jérémy Besnard, Sorel Muresan, and Andrew Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4:90–8, 02 2012. doi: 10.1038/nchem.1243. 10, 69

  5. [5]

    Le, Christopher Ré, and Azalia Mirhoseini

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. 1, 10, 19

  6. [6]

    Schützenberger

    Noam Chomsky and Marcel P. Schützenberger. The algebraic theory of context-free languages. InComputer Programming and Formal Systems, pages 118–161. North-Holland, 1963. 68

  7. [7]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  8. [8]

    de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot

    Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P. de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot. Nucleotide transformer: building and evaluating robust foun- dation models fo...

  9. [9]

    de Almeida, Franziska Reiter, Michaela Pagani, and Alexander Stark

    Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, and Alexander Stark. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers.Nature Genetics, 54:613–624, 2022. 10, 69

  10. [10]

    Qwen3-0.6b-diffusion-mdlm-v0.1

    dLLM Hub. Qwen3-0.6b-diffusion-mdlm-v0.1. https://huggingface.co/dllm-hub/ Qwen3-0.6B-diffusion-mdlm-v0.1, 2026. 10, 68

  11. [11]

    Hayes and Alistair Sinclair

    Thomas P. Hayes and Alistair Sinclair. Liftings of tree-structured markov chains. InApproxi- mation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 602–616. Springer, 2010. 2, 7, 8, 46

  12. [12]

    Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J. Foster. Is best-of-N the best of them? coverage, scaling, and optimality in inference-time alignment.arXiv preprint arXiv:2503.21878, 2025. 10, 19 14

  13. [13]

    Pan, Hyeji Kim, Sham Kakade, and Sitan Chen

    Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, and Sitan Chen. Fine-tuning masked diffusion for provable self-correction.arXiv preprint arXiv:2510.01384, 2025. 69

  14. [14]

    Test-time scaling in diffusion llms via hidden semi-autoregressive experts.arXiv preprint arXiv:2510.05040, 2025

    Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, and Amrit Singh Bedi. Test-time scaling in diffusion llms via hidden semi-autoregressive experts.arXiv preprint arXiv:2510.05040, 2025. 3, 14

  15. [15]

    Effective test- time scaling of discrete diffusion through iterative refinement.arXiv preprint arXiv:2511.05562,

    Sanghyun Lee, Sunwoo Kim, Seungryong Kim, Jongho Park, and Dongmin Park. Effective test- time scaling of discrete diffusion through iterative refinement.arXiv preprint arXiv:2511.05562,

  16. [16]

    Levin and Yuval Peres.Markov Chains and Mixing Times, volume 107

    David A. Levin and Yuval Peres.Markov Chains and Mixing Times, volume 107. American Mathematical Society, 2017. 7, 28, 32, 38, 39, 44

  17. [17]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. 1

  18. [18]

    Maiorov and Gordon M

    Vladimir N. Maiorov and Gordon M. Crippen. Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins.Journal of Molecular Biology, 235(2):625–634, 1994. ISSN 0022-2836. doi: https://doi.org/10.1006/jmbi.1994.1017. URL https://www.sciencedirect.com/science/article/pii/S0022283684710175. 10

  19. [19]

    Unmaskfork: Test-time scaling for masked diffusion via deterministic action branching.arXiv preprint arXiv:2602.04344, 2026

    Kou Misaki and Takuya Akiba. Unmaskfork: Test-time scaling for masked diffusion via deterministic action branching.arXiv preprint arXiv:2602.04344, 2026. 1, 3

  20. [20]

    Inference-time scaling of discrete diffusion models via importance weighting and optimal proposal design.arXiv preprint arXiv:2505.22524, 2025

    Zijing Ou, Chinmay Pani, and Yingzhen Li. Inference-time scaling of discrete diffusion models via importance weighting and optimal proposal design.arXiv preprint arXiv:2505.22524, 2025. 3

  21. [21]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 10, 69

  22. [22]

    Dral, Matthias Rupp, and O

    Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific Data, 1:140022,

  23. [23]

    Taming imperfect process verifiers: A sampling perspective on backtracking

    Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, and Dylan J Foster. Taming imperfect process verifiers: A sampling perspective on backtracking. arXiv preprint arXiv:2510.03149, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 10, 19, 24, 32, 46

  24. [24]

    Simple and effective masked diffusion language models

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems, volume 37,

  25. [25]

    Approximate counting, uniform generation and rapidly mixing markov chains.Information and Computation, 82(1):93–133, 1989

    Alistair Sinclair and Mark Jerrum. Approximate counting, uniform generation and rapidly mixing markov chains.Information and Computation, 82(1):93–133, 1989. 1, 2, 3, 5, 6, 8

  26. [26]

    Remasking discrete diffusion models with inference-time scaling

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 3

  27. [27]

    Value-guided search for efficient chain-of-thought reasoning

    Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, and Wen Sun. Value-guided search for efficient chain-of-thought reasoning. InAdvances in Neural Information Processing Systems, 2025. 1

  28. [28]

    Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand, 2024. Association for Co...

  29. [29]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. 1

  30. [30]

    High-resolution de novo structure prediction from primary sequence.bioRxiv preprint, 2022

    Ruidong Wu, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, Jianzhu Ma, and Jian Peng. High-resolution de novo structure prediction from primary sequence.bioRxiv preprint, 2022. 10, 70

  31. [31]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  32. [32]

    FUDGE: Controlled text generation with future discriminators

    Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535, Online, 2021. Association for Computational Linguistics. 1

  33. [33]

    D3LM: A discrete DNA diffusion language model for bidirectional DNA understanding and generation, 2026

    Zhao Yang, Hengchang Liu, Chuan Cao, and Bing Su. D3LM: A discrete DNA diffusion language model for bidirectional DNA understanding and generation, 2026. URL https: //arxiv.org/abs/2603.01780. 10, 69 16 TABLE OFCONTENTS 1 Introduction 1 2 Other related works 3 3 Preliminaries 3 4 Methods and Theoretical Guarantees 5 4.1 Formulation of MDM-VGB . . . . . . ...

  34. [34]

    Ifz∈ Z +(x)andi∈R(z), thenz −i ∈ Z +(x)

  35. [35]

    Proof.SinceZ(x)>0, U ⋆(x,∅) =Z(x)>0, so∅∈ Z +(x)

    If z∈ Z +(x), j /∈R(z), and c=z j←a /∈ Z+(x), then the AOAR edge {z, c} has zero balanced edge weight and zero geometric edge weight. Proof.SinceZ(x)>0, U ⋆(x,∅) =Z(x)>0, so∅∈ Z +(x). For the first claim, let p=z −i. Since p is obtained by re-masking one revealed coordinate of z, we have C(z)⊆ C(p). Thus U ⋆(x, p) = X y∈C(p) πref(y|x)τ(x, y)≥ X y∈C(z) πre...

  36. [36]

    naturalness: Does the text read like fluent natural English?

  37. [37]

    coherence: Is it internally coherent as a one-sentence story?

  38. [38]

    semantic_plausibility: Does it make semantic sense rather than feeling like token salad or broken text?

  39. [39]

    Use the task prompt as context when judging overall quality

    overall: Overall quality as an answer to the task prompt. Use the task prompt as context when judging overall quality. Return ONLY a JSON object with integer fields naturalness, coherence, semantic_plausibility, overall. Sudoku.Sudoku is a structured constraint-satisfaction benchmark. Each puzzle is a flattened 9×9 grid with fixed clues and editable empty...

  40. [40]

    For partial configurations, we use a heuristic process verifier that returns the indicator that no Sudoku constraint has yet been violated

    with a Diffusion Transformer backbone [21], trained on this data. For partial configurations, we use a heuristic process verifier that returns the indicator that no Sudoku constraint has yet been violated. QM9 molecule generation.QM9 is a small-molecule generation benchmark over organic molecules with up to nine heavy atoms [ 22]. We use a split with 127,...