Learning Unmasking Policies for Diffusion Language Models

Jason Ramapuram; Jo\~ao Monteiro; Louis B\'ethune; Marco Cuturi; Metod Jazbec; Michael Kirchhof; Pierre Ablin; Theo X. Olausson; Victor Turrisi

arxiv: 2512.09106 · v4 · pith:UYYAV7RWnew · submitted 2025-12-09 · 💻 cs.LG

Learning Unmasking Policies for Diffusion Language Models

Metod Jazbec , Theo X. Olausson , Louis B\'ethune , Pierre Ablin , Michael Kirchhof , Jo\~ao Monteiro , Victor Turrisi , Jason Ramapuram

show 1 more author

Marco Cuturi

This is my paper

classification 💻 cs.LG

keywords diffusionperformancesamplingunmaskingblockdllmdllmsheuristics

0 comments

read the original abstract

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Order Policies for Masked Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
Fixed-Point Masked Generative Modeling
cs.LG 2026-05 unverdicted novelty 6.0

FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
Re-evaluating Confidence Remasking in Masked Diffusion Language Models
cs.LG 2026-06 unverdicted novelty 3.0

Re-evaluation finds post-hoc remasking (WINO) yields little-to-no gain over confidence unmasking in standard dLLM settings and can worsen diversity collapse under stochastic decoding.