Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

David Alvarez-Melis; Jaeyeon Kim; Jonathan Geuter; Sham Kakade; Sitan Chen

arxiv: 2602.10314 · v2 · pith:M6WYU7X6new · submitted 2026-02-10 · 💻 cs.LG

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Jaeyeon Kim , Jonathan Geuter , David Alvarez-Melis , Sham Kakade , Sitan Chen This is my paper

classification 💻 cs.LG

keywords trainingmaskingmaskspumaunmaskingdiffusioninference-timemasked

0 comments

read the original abstract

Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
cs.LG 2026-06 unverdicted novelty 6.0

MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Fixed-Point Masked Generative Modeling
cs.LG 2026-05 unverdicted novelty 6.0

FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
cs.AI 2026-05 unverdicted novelty 6.0

Confidence-based decoding and training in masked diffusion models shortcut long-range dependencies in reasoning, producing errors on complex inputs that random masking avoids.