arxiv: 2602.06462 · v3 · submitted 2026-02-06 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba , Hiroki Furuta , Naoaki Okazaki

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords masked diffusionpolicy optimizationcredit assignmentlanguage modelsreinforcement learningdiffusion modelsmath reasoningplanning tasks

0 comments

The pith

DiSPO assigns credit to intermediate token-filling decisions in masked diffusion language models by branching from cached logits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked diffusion language models generate text by iteratively filling masked tokens, yet terminal rewards on final outputs give only coarse signals for the many intermediate choices that shape the result. DiSPO adds a plug-in layer that selects intermediate masked states, resamples the remaining masked positions from logits cached during rollout, scores the completed sequences, and applies policy-gradient updates solely to the newly filled tokens. The method reuses the exact same rollouts already computed for terminal-feedback baselines and needs no extra diffusion steps or optimizer iterations. Experiments on LLaDA-8B-Instruct show consistent gains over diffu-GRPO and SPG on math and planning benchmarks when rollout count and training steps are held fixed. The approach rests on a fixed-state objective that yields an unbiased estimator for the branched completions.

Core claim

At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps.

What carries the argument

DiSPO credit-assignment layer that branches completions at intermediate masked states from cached logits to supply targeted policy-gradient updates to newly filled tokens.

Load-bearing premise

Resampling completions from rollout-cached logits at selected intermediate masked states yields unbiased or low-variance credit signals for the newly filled tokens without systematic bias from the branching process.

What would settle it

Running the identical LLaDA-8B-Instruct math and planning benchmarks with matched rollout count and optimizer steps and finding that DiSPO produces no score improvement or a decrease relative to the terminal-feedback baselines.

Figures

Figures reproduced from arXiv: 2602.06462 by Daisuke Oba, Hiroki Furuta, Naoaki Okazaki.

**Figure 1.** Figure 1: Conceptual overview. Top: Terminal-feedback GRPO treats the denoising trajectory as one decision. Bottom: DiSPO is a plug-in step that branches at intermediate states (resample Z fillings from cached logits), scores them with the same reward, and backpropagates gradients only through the filled tokens. guage models (MDLMs), which generate by repeatedly filling masked positions over multiple denoising step… view at source ↗

**Figure 2.** Figure 2: Reward curves. Terminal reward curves (top) and step reward curves (bottom) on LLaDA-8B-Instruct during policy optimization. Across tasks, DISPO reaches higher terminal rewards earlier and maintains them over training. Step rewards exhibit relatively smaller magnitudes but follow trends as terminal rewards, indicating their role as a complementary training signal. 5.1. Setup Models. We evaluate LLaDA-8B-In… view at source ↗

**Figure 3.** Figure 3: Variance reduction of the step-wise gradient estimator on Sudoku. Left: Updating only action tokens (vs. all tokens) reduces variance at Z=2 (Prop. 4.3). Right: Increasing Z from Z=2 reduces variance with action-only updates (Prop. 4.4). Error bars show paired 95% bootstrap CIs [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: compares the same instance at the same denoising step: diffu-GRPO already violates constraints due to an early incorrect fill, whereas DISPO maintains a consistent partial assignment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Wall-clock-matched training curves on LLaDA-8BInstruct for Sudoku. Accuracy (Ngen=128) and reward vs. training time. DISPO surpasses diffu-GRPO within the budget. comparisons: at a fixed intermediate state, we contrast alternative mask fillings (actions) rather than only learning from terminal rollouts. State-aware reward shaping. SAPO scores intermediate denoising states to build step-aware bonuses tha… view at source ↗

read the original abstract

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment method for masked diffusion language models. It formalizes a fixed-state objective over branched completions at selected intermediate masked states, derives a policy-gradient estimator that reuses terminal rollouts by resampling only newly unmasked tokens from cached logits, and reports consistent empirical gains over terminal-feedback baselines (diffu-GRPO, SPG) on math and planning benchmarks with LLaDA-8B-Instruct under matched rollout compute and optimizer steps.

Significance. If the estimator is unbiased, DiSPO would provide a practical, low-overhead way to improve intermediate credit assignment in iterative masked generation without extra multi-step rollouts. The reuse of existing terminal rollouts is a clear efficiency strength that could generalize to other diffusion-based RL setups for reasoning tasks.

major comments (1)

[Method] The derivation of the policy-gradient estimator for the fixed-state objective (abstract and method description) treats resampled branches from rollout-cached logits at intermediate masked states as unbiased samples from the conditional policy. No importance-sampling correction or baseline adjustment for the distribution shift induced by the original unbranched rollout trajectory is described; this is load-bearing because any resulting bias in credit signals for the newly filled tokens would undermine the claimed improvements over diffu-GRPO and SPG.

minor comments (1)

[Abstract] The abstract states that a fixed-state objective is formalized and a policy-gradient estimator derived but supplies no equations or proof sketch; adding these (even in an appendix) would allow direct verification of the estimator's properties.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying a key point in the derivation of our policy-gradient estimator. We address the major comment below with a clarification of the fixed-state objective and the unbiasedness of the estimator.

read point-by-point responses

Referee: [Method] The derivation of the policy-gradient estimator for the fixed-state objective (abstract and method description) treats resampled branches from rollout-cached logits at intermediate masked states as unbiased samples from the conditional policy. No importance-sampling correction or baseline adjustment for the distribution shift induced by the original unbranched rollout trajectory is described; this is load-bearing because any resulting bias in credit signals for the newly filled tokens would undermine the claimed improvements over diffu-GRPO and SPG.

Authors: We thank the referee for this observation. The fixed-state objective is explicitly conditional on a selected intermediate masked state s: it is the expected terminal reward over completions generated from s onward. The policy-gradient estimator is derived for this conditional objective J(s). At any such fixed s, the branched completions are obtained by resampling the remaining masked tokens directly from the current policy's logits at s (i.e., from π(·|s)). Because these action samples are drawn on-policy from the conditional policy at the fixed state, the resulting estimator is unbiased for ∇_θ J(s) without requiring importance-sampling corrections that would account for the probability of reaching s under the original rollout. The original trajectory serves only to identify and cache the intermediate states and their logits; it does not alter the sampling distribution of the actions taken from those states. We will add a short clarifying paragraph in Section 3.2 to make this conditional unbiasedness explicit and to contrast it with the unconditional terminal-feedback estimators used by diffu-GRPO and SPG. revision: partial

Circularity Check

0 steps flagged

No significant circularity; estimator derived independently from fixed-state objective

full rationale

The paper formalizes a fixed-state objective for branched completions at intermediate masked states and derives a policy-gradient estimator that reuses the same terminal rollouts required by baseline methods such as diffu-GRPO and SPG. This derivation does not reduce by construction to a fitted parameter, self-definition, or unverified self-citation chain. The central empirical claim of consistent improvement under matched rollout compute and optimizer steps is presented as an external benchmark result rather than a tautological consequence of the estimator's internal form. No load-bearing uniqueness theorems, smuggled ansatzes, or renamings of known results are invoked in the derivation chain. The construction is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The method reuses existing cached logits and terminal rollouts; no new conserved quantities or particles are introduced.

pith-pipeline@v0.9.0 · 5482 in / 1105 out tokens · 48459 ms · 2026-05-16T07:14:48.723580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

work page arXiv
[3]

Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

work page arXiv
[4]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

s1: Simple test-time scaling

URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Large Language Diffusion Models

URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2510.08554

Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,

work page arXiv
[8]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

work page arXiv
[11]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards

Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

work page arXiv
[14]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

and Boull ´e, N

Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,

work page arXiv
[16]

Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a

Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...

work page arXiv 1909
[17]

Condition on a particular timestep t being selected

on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...

work page 2024
[18]

We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset

is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...

work page 2025