Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

· 2026 · cs.CL · arXiv 2601.15593

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions -- parallelism strength and generation order -- using Average Finalization Parallelism (AFP) and Kendall's tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require "backward information" (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.

representative citing papers

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.

Locally Coherent Parallel Decoding in Diffusion Language Models

cs.CL · 2026-03-03 · unverdicted · novelty 7.0

CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.

citing papers explorer

Showing 2 of 2 citing papers.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding cs.CL · 2026-05-15 · unverdicted · none · ref 20 · internal anchor
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.
Locally Coherent Parallel Decoding in Diffusion Language Models cs.CL · 2026-03-03 · unverdicted · none · ref 16 · internal anchor
CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

fields

years

verdicts

representative citing papers

citing papers explorer