Recognition: no theorem link
Diffusion-State Policy Optimization for Masked Diffusion Language Models
Pith reviewed 2026-05-16 07:14 UTC · model grok-4.3
The pith
DiSPO assigns credit to intermediate token-filling decisions in masked diffusion language models by branching from cached logits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps.
What carries the argument
DiSPO credit-assignment layer that branches completions at intermediate masked states from cached logits to supply targeted policy-gradient updates to newly filled tokens.
Load-bearing premise
Resampling completions from rollout-cached logits at selected intermediate masked states yields unbiased or low-variance credit signals for the newly filled tokens without systematic bias from the branching process.
What would settle it
Running the identical LLaDA-8B-Instruct math and planning benchmarks with matched rollout count and optimizer steps and finding that DiSPO produces no score improvement or a decrease relative to the terminal-feedback baselines.
Figures
read the original abstract
Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment method for masked diffusion language models. It formalizes a fixed-state objective over branched completions at selected intermediate masked states, derives a policy-gradient estimator that reuses terminal rollouts by resampling only newly unmasked tokens from cached logits, and reports consistent empirical gains over terminal-feedback baselines (diffu-GRPO, SPG) on math and planning benchmarks with LLaDA-8B-Instruct under matched rollout compute and optimizer steps.
Significance. If the estimator is unbiased, DiSPO would provide a practical, low-overhead way to improve intermediate credit assignment in iterative masked generation without extra multi-step rollouts. The reuse of existing terminal rollouts is a clear efficiency strength that could generalize to other diffusion-based RL setups for reasoning tasks.
major comments (1)
- [Method] The derivation of the policy-gradient estimator for the fixed-state objective (abstract and method description) treats resampled branches from rollout-cached logits at intermediate masked states as unbiased samples from the conditional policy. No importance-sampling correction or baseline adjustment for the distribution shift induced by the original unbranched rollout trajectory is described; this is load-bearing because any resulting bias in credit signals for the newly filled tokens would undermine the claimed improvements over diffu-GRPO and SPG.
minor comments (1)
- [Abstract] The abstract states that a fixed-state objective is formalized and a policy-gradient estimator derived but supplies no equations or proof sketch; adding these (even in an appendix) would allow direct verification of the estimator's properties.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for identifying a key point in the derivation of our policy-gradient estimator. We address the major comment below with a clarification of the fixed-state objective and the unbiasedness of the estimator.
read point-by-point responses
-
Referee: [Method] The derivation of the policy-gradient estimator for the fixed-state objective (abstract and method description) treats resampled branches from rollout-cached logits at intermediate masked states as unbiased samples from the conditional policy. No importance-sampling correction or baseline adjustment for the distribution shift induced by the original unbranched rollout trajectory is described; this is load-bearing because any resulting bias in credit signals for the newly filled tokens would undermine the claimed improvements over diffu-GRPO and SPG.
Authors: We thank the referee for this observation. The fixed-state objective is explicitly conditional on a selected intermediate masked state s: it is the expected terminal reward over completions generated from s onward. The policy-gradient estimator is derived for this conditional objective J(s). At any such fixed s, the branched completions are obtained by resampling the remaining masked tokens directly from the current policy's logits at s (i.e., from π(·|s)). Because these action samples are drawn on-policy from the conditional policy at the fixed state, the resulting estimator is unbiased for ∇_θ J(s) without requiring importance-sampling corrections that would account for the probability of reaching s under the original rollout. The original trajectory serves only to identify and cache the intermediate states and their logits; it does not alter the sampling distribution of the actions taken from those states. We will add a short clarifying paragraph in Section 3.2 to make this conditional unbiasedness explicit and to contrast it with the unconditional terminal-feedback estimators used by diffu-GRPO and SPG. revision: partial
Circularity Check
No significant circularity; estimator derived independently from fixed-state objective
full rationale
The paper formalizes a fixed-state objective for branched completions at intermediate masked states and derives a policy-gradient estimator that reuses the same terminal rollouts required by baseline methods such as diffu-GRPO and SPG. This derivation does not reduce by construction to a fitted parameter, self-definition, or unverified self-citation chain. The central empirical claim of consistent improvement under matched rollout compute and optimizer steps is presented as an external benchmark result rather than a tautological consequence of the estimator's internal form. No load-bearing uniqueness theorems, smuggled ansatzes, or renamings of known results are invoked in the derivation chain. The construction is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,
-
[3]
He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,
-
[4]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Large Language Diffusion Models
URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2510.08554
Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,
-
[8]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,
-
[11]
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards
Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,
-
[14]
Dream 7B: Diffusion Large Language Models
Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,
-
[16]
Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...
-
[17]
Condition on a particular timestep t being selected
on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...
work page 2024
-
[18]
We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset
is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.