DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Andi Han; Atsushi Nitanda; Dake Bu; Hau-San Wong; Qingfu Zhang; Taiji Suzuki; Wei Huang

arxiv: 2604.24357 · v2 · pith:LWK7VBFAnew · submitted 2026-04-27 · 💻 cs.LG · cs.AI

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Dake Bu , Wei Huang , Andi Han , Hau-San Wong , Qingfu Zhang , Taiji Suzuki , Atsushi Nitanda This is my paper

Pith reviewed 2026-05-08 04:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion language modelstoken orderingDoob h-transformprocess reward modelplug-in modulemasked diffusiongenerative modeling

0 comments

The pith

DPRM introduces a plug-in module that shifts token ordering in diffusion language models from confidence rules to Doob h-transform process reward guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that diffusion language models can improve generation quality by treating token reveal order as a controllable policy rather than a fixed or random choice. It shows that starting with confidence-based ordering and gradually incorporating Doob h-transform guided rewards, while leaving the underlying model and training objective untouched, produces better results than baselines. A sympathetic reader would care because this targets train-test mismatch and myopic decisions without requiring new architectures or loss functions. If the approach holds, ordering policy becomes a separable lever that can be tuned for harder tasks and multi-objective design problems.

Core claim

DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. The exact DPRM policy is characterized as a reward-tilted Gibbs reveal law, with O(1/N) convergence of the stagewise Soft-BoN approximation and online bucketized controller tracking at empirical-Bernstein rates. Under tractable optimization assumptions this yields a sample-complexity advantage over random and confidence-only ordering, with observed gains over baselines in pretraining, post-training, test-time scaling and mask

What carries the argument

The DPRM policy, defined as the reward-tilted Gibbs reveal law induced by the Doob h-transform of the process reward, which gradually replaces confidence-driven ordering via online estimates.

Load-bearing premise

The online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates and tractable optimization assumptions hold to deliver sample-complexity advantage over random and confidence-only ordering.

What would settle it

A controlled experiment on a hard reasoning benchmark that finds no accuracy improvement when the full DPRM policy is used versus a confidence-only baseline would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2604.24357 by Andi Han, Atsushi Nitanda, Dake Bu, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Wei Huang.

**Figure 1.** Figure 1: DPRM as a plug-in token-ordering module. The host provides a local proposal over candidate actions, and DPRM replaces only the ordering rule. In practice, DPRM starts from confidence-based warmup and then uses shortlist-based Soft-BoN reweighting together with an online reward estimate to approximate the exact tilted reveal law. should be read as the current partially observed token array together with its… view at source ↗

**Figure 2.** Figure 2: PUMA vs. DPRM-PUMA on GSM8K at the shared 1.53M EMA checkpoint. We use the two official PUMA validation settings, unmasking num∈ {2, 3}. DPRM-PUMA improves both view at source ↗

**Figure 4.** Figure 4: GSM8K pass@K curves by difficulty level (0: trivial, 1: easy, 2: medium, 3: hard). DMPO-DPRM’s advantage over Progressive DMPO is most visible on harder levels and at larger K. This subsection provides the full experimental details for the DPRM-DMPO results reported in Sec. 4.2. 17 view at source ↗

**Figure 3.** Figure 3: Bootstrap confidence intervals for PUMA and DPRM-PUMA at the shared 1.53M checkpoint. The two official unmasking settings both favor DPRM-PUMA, and the paired-bootstrap deltas in Tab. 6 exclude zero at the 95% level. Benchmarks and metrics. We evaluate completed post-training runs on GSM8K, MATH, and Countdown using pass@K curves for the six tested values K ∈ {1, 2, 4, 8, 16, 32}. For compact comparison in… view at source ↗

**Figure 5.** Figure 5: MATH pass@K curves by difficulty level (1: trivial, 2: easy, 3: medium, 4: hard, 5: OOD). DPRM-DMPO provides consistent gains on hard and OOD subsets. Inference configuration. For DMPO-DPRM decoding we use its aligned inference rule: fast dllm with pd cache prefix, remasking=dprm soft bon, block length 32, temperature 0.2, and the checkpoint-local dprm estimator.json. In the reported evaluation scripts, th… view at source ↗

**Figure 6.** Figure 6: Countdown pass@K curves by number of target operands (2–6). Vanilla DMPO collapses on this task, falling below the base model at every difficulty level. DPRM-DMPO achieves the strongest performance across all levels. Interpretation. Taken together, the experiments support a two-step interpretation. Random masking is a poor state sampler for post-training because it allocates denoising capacity to states th… view at source ↗

**Figure 7.** Figure 7: Per-rank accuracy comparison on GSM8K. Rank 1 is the highest-scored survivor after pruning. DPRM-Prism improves at every rank position. differ only in the token-ordering policy used during trajectory pruning and remasking: 1. Prism (confidence): the original Prism baseline, which uses confidence top-k to rank and select tokens at each unmasking step; 2. DPRM-Prism: our method, which replaces confidence top… view at source ↗

**Figure 8.** Figure 8: Left: NFE–accuracy trade-off. The diamond markers show reference baselines from the Prism paper. Right: per-sample NFE distributions. NFE overhead. The ×1.76 NFE increase of DPRM-Prism over the baseline originates entirely from the DPRM ordering layer: at each unmasking step, Soft-BoN evaluates multiple candidate token orderings before selecting one. The Prism search scaffold (HTS branching, SVF calls, pru… view at source ↗

**Figure 9.** Figure 9: Forward-folding comparison on CAMEO2022 with 95% bootstrap intervals over targets. In the experiments, all three orderingaware variants are statistically indistinguishable on the forward-folding metrics and all improve over DPLM-2 Bit. Interpretation. The DPLM results are useful because they separate several notions of protein quality: local structural agreement, global fold similarity, model confidence, … view at source ↗

**Figure 10.** Figure 10: Unconditional co-generation self-consistency over lengths 100–500 with 95% bootstrap intervals over generated samples. Progressive DPLM-2 Bit is strongest on bb-TM, pLDDT, and designable rate, while DPRM-DPLM-2 Bit incurs a smaller ca-RMSD penalty than the other ordering-aware variants. DPRM-DPLM offers a milder trade-off with comparable bb-TM and designable rate but a substantially smaller ca-RMSD penalt… view at source ↗

**Figure 11.** Figure 11: Dentate Gyrus DCM ordering evaluation with 95% bootstrap intervals over 293 validation cells. All ordering-aware variants improve token recovery, MAE, and zero-expression accuracy over random ordered masking. DPRM(random)-DCM is strongest on MAE and zero-expression accuracy in this compact setting. Interpretation. The DCM result should be read as evidence for the plug-in nature of DPRM rather than as a cl… view at source ↗

**Figure 12.** Figure 12: GenMol V2 de novo molecular generation with 95% bootstrap intervals over 1,000 generated molecules per method. GenMol V2 remains strongest on quality and uniqueness; DPRM(random)-GenMol has the highest validity; Progressive-GenMol has the highest diversity view at source ↗

**Figure 13.** Figure 13: GenMol V2 fragment-constrained generation on the common stable seven-fragment subset. Error bars show 95% bootstrap intervals over fragment-task units. DPRM(random)-GenMol improves linker and linker-onestep validity, while Progressive and DPRMconfidence improve motif-extension and scaffold-decoration quality. Interpretation. The GenMol pilot is intentionally conservative. It supports the claim that token… view at source ↗

**Figure 14.** Figure 14: SDPO ordering comparison with 95% bootstrap intervals over 640 generated DNA samples per method. Confidence-progressive ordering improves HepG2 and log-likelihood but collapses ATAC and k-mer quality. DPRM variants preserve substantially more ATAC and k-mer quality while still improving HepG2 over the SDPO baseline. Interpretation. The SDPO experiment reinforces the multi-objective nature of scientific di… view at source ↗

**Figure 15.** Figure 15: Countdown training reward versus global step from W&B logs. Bands use the logged reward standard deviation. The random DMPO curve stitches the initial run and its resume run; Progressive DMPO and DMPO-DPRM use their completed matched runs. This plot is an optimization diagnostic rather than an evaluation metric. Token-level evidence for the early proxy assumption view at source ↗

**Figure 16.** Figure 16: Instrumented Countdown diagnostics for the finite-sample ordering theory. Panel (a) shows that confidence bins are a strong proxy for local CE loss in early training. Panel (b) shows that confidence-aligned controllers select lower-CE tokens than random ordering early. Panel (c) shows the beta sensitivity of late low-confidence selected-token mass. Intervals in Panel (c) are bootstrap intervals over logge… view at source ↗

**Figure 17.** Figure 17: Late-stage coverage inside the low-confidence region on instrumented Countdown reruns. DPRM increases selected-token mass in bins 0–5 relative to confidence-only Progressive DMPO while assigning positive DPRM scores to selected tokens as β grows. This is the bin-level diagnostic corresponding to the theorem’s late under-coverage assumption. 30 view at source ↗

**Figure 18.** Figure 18: Outcome-level diagnostics for the finite-sample ordering theory on Countdown. Panel (a) supports the early-stage assumption that confidence is a useful local optimization proxy: confidence-aligned progressive training dominates random DMPO at pass@1, especially on easier operand-count subsets. Panels (b)–(c) support the late-stage confidence-undercoverage assumption: DPRM’s gain over confidence-only Progr… view at source ↗

read the original abstract

Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train--test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPRM adds a Doob h-transform plug-in for token ordering in diffusion LMs that brings selective gains on reasoning tasks and a clean characterization of the policy, but the sample-complexity advantage depends on assumptions the experiments do not directly check.

read the letter

This paper's main move is to swap in a Doob h-transform process reward module for deciding which tokens to reveal or revise in diffusion language models. It leaves the host network, denoising objective, and training data untouched, which makes the change easy to test on top of existing systems. The reported improvements on harder reasoning subsets in pretraining and test-time scaling are the clearest practical signal here.

Referee Report

2 major / 2 minor

Summary. The paper introduces DPRM, a plug-in Doob h-transform-induced token-ordering module for diffusion language models. It claims to keep the host architecture, denoising objective and supervision unchanged while shifting the ordering policy from confidence-driven progressive ordering to Doob h-transform Process Reward guided ordering through online estimates. The manuscript characterizes the exact DPRM policy as a reward-tilted Gibbs reveal law, proves O(1/N) convergence of the stagewise Soft-BoN approximation, shows that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and claims a sample-complexity advantage over random and confidence-only ordering under tractable optimization assumptions. Empirically, DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion (with strong gains on harder reasoning subsets); in protein, molecular generation and DNA design the gains are multi-objective, improving selected structural or fragment-constrained metrics without uniformly dominating every quality metric.

Significance. If the theoretical results hold, DPRM provides a general-purpose, architecture-preserving module that treats token ordering as a controllable axis in diffusion LMs, potentially improving sample efficiency and performance on reasoning and generative tasks. The plug-in design and public code release are strengths that facilitate adoption and reproducibility. The multi-objective empirical profile in biological domains suggests practical utility but also indicates that benefits are metric-dependent rather than blanket improvements.

major comments (2)

[Abstract] Abstract: the claim that DPRM 'yields a sample-complexity advantage over random and confidence-only ordering' under 'tractable optimization assumptions' is load-bearing for the theoretical contribution, yet the assumptions (convexity, Lipschitz constants on the reward tilt, bounded variance of the Doob h-transform estimator, etc.) are never stated. Without them it is impossible to verify whether the stated O(1/N) convergence of Soft-BoN and empirical-Bernstein tracking of the bucketized controller actually produce the promised rate advantage in the diffusion-LM regime; the empirical sections report only final performance metrics, not direct measurements of sample complexity or tracking error versus N.
[Abstract] Abstract: the characterizations of the DPRM policy as a reward-tilted Gibbs reveal law, the O(1/N) convergence proof, and the empirical-Bernstein tracking claim are presented as central results, but the abstract supplies no key steps, conditions, or equation references. Given that the full derivations and experimental protocols are required to assess soundness, these claims must be expanded with explicit statements of all assumptions and direct empirical validation of the rates before the theoretical superiority can be accepted.

minor comments (2)

Notation for 'Doob h-transform' is inconsistent between the title ('Doob h transform-induced') and the abstract ('Doob h-transform'); standardize throughout.
The multi-objective nature of results in protein/molecular/DNA tasks is noted, but uniform reporting of all quality metrics (not only the improved subset) in a single table would clarify trade-offs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We agree that the abstract would benefit from greater explicitness regarding assumptions, characterizations, and empirical validation of rates. We address each major comment below and will revise the manuscript to incorporate these improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that DPRM 'yields a sample-complexity advantage over random and confidence-only ordering' under 'tractable optimization assumptions' is load-bearing for the theoretical contribution, yet the assumptions (convexity, Lipschitz constants on the reward tilt, bounded variance of the Doob h-transform estimator, etc.) are never stated. Without them it is impossible to verify whether the stated O(1/N) convergence of Soft-BoN and empirical-Bernstein tracking of the bucketized controller actually produce the promised rate advantage in the diffusion-LM regime; the empirical sections report only final performance metrics, not direct measurements of sample complexity or tracking error versus N.

Authors: We agree that the abstract does not enumerate the assumptions, even though they are formally stated in Section 3. In the revised version we will explicitly list the key assumptions (convexity of the optimization landscape, Lipschitz continuity of the reward tilt, and bounded variance of the Doob h-transform estimator) directly in the abstract. We will also add a supplementary analysis with plots of tracking error and effective sample complexity versus N to provide direct empirical validation of the O(1/N) and empirical-Bernstein rates. revision: yes
Referee: [Abstract] Abstract: the characterizations of the DPRM policy as a reward-tilted Gibbs reveal law, the O(1/N) convergence proof, and the empirical-Bernstein tracking claim are presented as central results, but the abstract supplies no key steps, conditions, or equation references. Given that the full derivations and experimental protocols are required to assess soundness, these claims must be expanded with explicit statements of all assumptions and direct empirical validation of the rates before the theoretical superiority can be accepted.

Authors: We acknowledge that the abstract is concise and omits explicit references to theorems or equation numbers. We will revise the abstract to include brief statements of the main results (reward-tilted Gibbs characterization, O(1/N) convergence of the stagewise Soft-BoN approximation, and empirical-Bernstein tracking) together with pointers to the relevant theorems and sections. The full derivations, conditions, and experimental protocols already appear in the main text and appendix; the revision will make these connections clearer to readers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rely on external mathematical tools and standard statistical bounds.

full rationale

The paper characterizes the DPRM policy as a reward-tilted Gibbs reveal law, proves O(1/N) convergence of the Soft-BoN approximation, and invokes empirical-Bernstein rates for the bucketized controller. These steps are presented as independent mathematical results rather than reductions to fitted parameters or self-referential definitions. The sample-complexity advantage is explicitly conditioned on unspecified 'tractable optimization assumptions' without claiming it follows by construction from the inputs. No self-citations appear load-bearing in the abstract or description, and the Doob h-transform is treated as an external tool. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond reference to the standard Doob h-transform from prior stochastic-process literature. The DPRM module itself is an algorithmic construction rather than a new postulated physical entity.

pith-pipeline@v0.9.0 · 5630 in / 1411 out tokens · 39349 ms · 2026-05-08T04:09:31.083964+00:00 · methodology

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)