d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation

Gilad Turok; Guanghan Wang; Marianne Arriola; Volodymyr Kuleshov; Yair Schiff

arxiv: 2509.21474 · v4 · pith:4EFBXV5Qnew · submitted 2025-09-25 · 💻 cs.LG

d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation

Guanghan Wang , Gilad Turok , Yair Schiff , Marianne Arriola , Volodymyr Kuleshov This is my paper

classification 💻 cs.LG

keywords dlmsreasoningtrajectorydiffusionlikelihoodmaskedmodelsalgorithm

0 comments

read the original abstract

While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Because computing these likelihoods naively is computationally expensive for masked DLMs, we develop a family of estimators tailored to distinct model classes. For DLMs that support a sampling algorithm called any-order decoding, we propose d2-AnyOrder, which achieves exact trajectory likelihood with a single model pass. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. For standard masked diffusion models, we propose d2-StepMerge, which approximates the trajectory likelihood, trading off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500). We provide the code along with a blog post on the project page: https://guanghanwang.com/d2

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learnability-Informed Fine-Tuning of Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Discrete Tilt Matching
cs.LG 2026-04 unverdicted novelty 7.0

DTM recasts dLLM fine-tuning as weighted cross-entropy matching of tilted local posteriors, with demonstrated gains on Sudoku and math tasks.
Discrete Tilt Matching
cs.LG 2026-04 unverdicted novelty 7.0

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied ...
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.