pith. sign in

arxiv: 2601.15165 · v4 · pith:MZLXIWWXnew · submitted 2026-01-21 · 💻 cs.CL · cs.AI· cs.LG

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

classification 💻 cs.CL cs.AIcs.LG
keywords dllmsarbitraryflexibilityorderreasoningdiffusioneffectivegeneration
0
0 comments X
read the original abstract

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

    cs.LG 2026-05 conditional novelty 7.0

    TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.

  2. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 conditional novelty 7.0

    DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

  3. LogicDiff: Logic-Guided Denoising Improves Zero-Shot Reasoning in Masked Diffusion Language Models

    cs.CL 2026-03 conditional novelty 7.0

    Logic-role-guided unmasking in masked diffusion models raises zero-shot GSM8K accuracy from 22% to 61% by enforcing logical generation order.

  4. Differences in Text Generated by Diffusion and Autoregressive Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.

  5. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...