Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

Ruihong Qiu; Yan Jiang; Zi Huang

arxiv: 2605.02263 · v2 · pith:OIEKLWYPnew · submitted 2026-05-04 · 💻 cs.LG

Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

Yan Jiang , Ruihong Qiu , Zi Huang This is my paper

Pith reviewed 2026-05-08 18:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion large language modelsdynamic reasoning blocksmonotonic entropy descentreinforcement learningreasoning coherencesemi-autoregressive generationblock-wise entropy

0 comments

The pith

Dynamic block sizes selected by monotonic entropy descent and reinforcement learning improve reasoning coherence in diffusion large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed-size blocks in dLLMs limit coherent reasoning because the best block length differs across tasks and even within one task, often breaking logical flow. The paper observes that correct reasoning produces steadily descending block-wise entropy while incorrect reasoning shows fluctuations. b1 trains a policy via reinforcement learning to pick block sizes that enforce a monotonic entropy descent, turning this observed pattern into a reward signal. The resulting dynamic blocks integrate directly into existing post-training pipelines and raise performance on reasoning benchmarks over fixed-size baselines.

Core claim

b1 learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence and integrates seamlessly as a plug-and-play module with existing dLLM post-training algorithms.

What carries the argument

Monotonic Entropy Descent objective with reinforcement learning, which rewards block-size choices that produce consistently descending entropy across blocks.

If this is right

Block sizes adapt automatically to task-specific and intra-task needs instead of relying on a single fixed length.
Logical flow stays intact because block boundaries are chosen only when entropy continues to descend.
The method adds to any existing dLLM post-training routine without altering the base model or training loop.
Benchmark scores rise consistently when the entropy-guided policy replaces fixed blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy signal might be used to early-stop generation on paths that begin to fluctuate, saving compute.
The approach could transfer to other block-wise or semi-autoregressive generators beyond diffusion LLMs.
Explicit tests on tasks with very long reasoning chains would show whether the policy scales without extra tuning.

Load-bearing premise

The observed block-wise entropy trend reliably signals whether reasoning is correct and supplies a stable reward signal for RL without destabilizing generation.

What would settle it

An experiment in which dynamic blocks chosen by the entropy signal produce no accuracy gain on reasoning benchmarks, or in which entropy trends fail to separate correct from incorrect outputs.

Figures

Figures reproduced from arXiv: 2605.02263 by Ruihong Qiu, Yan Jiang, Zi Huang.

**Figure 1.** Figure 1: Analysis of block size against reasoning performance in dLLMs. The x-axis indicates block size for generation. Results demonstrate that the optimal block size varies significantly across different reasoning benchmarks. works, dLLMs generate tokens in a parallel manner, mostly within blocks (Nie et al., 2025; Zhu et al., 2025a). Specifically, a sequence is separated into multiple blocks with a pre-defined… view at source ↗

**Figure 2.** Figure 2: Token-wise entropy comparison between incorrect and correct reasoning with fixed-size blocks by wd1 on the Countdown reasoning benchmark. (a) Rigid boundaries disrupt numerical operations (e.g., splitting “71 − 66” between Block 3 and 4), causing high-entropy anomalies and incorrect results. (b) Coherent reasoning is maintained when block boundaries do not interrupt calculations, reflected by a low and des… view at source ↗

**Figure 3.** Figure 3: Block-wise entropy evolution for LLaDA, d1, and wd1 by their default block size of 32 on Math500. Red and green lines denote block generations for incorrect and correct reasoning results, respectively. Shaded areas represent standard deviation. The block entropy is calculated as the mean token-wise entropy within the same block. While correct reasoning generations exhibit a consistent descent in block entr… view at source ↗

**Figure 4.** Figure 4: Illustration of (a) conventional fixed-size block generation, and (b) our proposed dynamic-size reasoning blocks, for dLLMs. Fixed-size generation enforces rigid sequence partitioning, which results in the generation of nondeterministic tokens with high entropy and yields incorrect results. In contrast, dynamic-size reasoning blocks align blocks with each reasoning step flexibly, avoiding the disruption o… view at source ↗

**Figure 5.** Figure 5: Analysis of MED with dLLMs’ reasoning performance. All reasoning tasks are ranked and grouped according to their rSCC. Results demonstrate that reasoning generations with a higher rSCC (a monotonic descending trend in block entropy) achieve higher accuracy. 4.2. Ablation Study The ablation study of b1 is presented in view at source ↗

**Figure 6.** Figure 6: Correlation between rSCC improvement and error reduction on hard reasoning samples. For incorrectly predicted samples in the fixed-size baseline, b1 demonstrates a unique capability to consistently increase rSCC, thereby facilitating a monotonic entropy descent trend and reducing the reasoning errors. 4.5. b1 in Improving Hard Reasoning Samples This experiment delves into the “hard” reasoning samples, wh… view at source ↗

**Figure 7.** Figure 7: Generation and token entropy of wd1 and our b1 with a same prompt on Countdown. While fixed-size blocks in wd1 disrupt reasoning steps and repeatedly generate high-entropy trivial tokens (e.g., full stops), our strategy dynamically aligns boundaries with reasoning steps. This ensures a coherent reasoning trace, leading to correct results. Special tokens including \block and <|endoftext|> are hidden for imp… view at source ↗

**Figure 8.** Figure 8: Hyperparameter Sensitivity Analysis on MATH500. The grey line represents the baseline wd1 performance by fixed-size blocks, while the red line denotes wd1 + b1 across varying reward weights. Results indicate that b1 is robust to hyperparameter variations and consistently outperforms the fixed-size baseline. Furthermore, the performance gain at weight 1.0 compared to 0 confirms the contribution of the propo… view at source ↗

**Figure 9.** Figure 9: Case studies on the GSM8K dataset compare dLLMs post-trained via wd1 using default fixed-size blocks against proposed dynamic reasoning block generation strategies, evaluated on the "John’s driving" problem (Ground Truth: 45). The number behind each block indicates the block size. Fixed-size models rigidly partition reasoning logic and truncate mathematical calculations. Such fragmentation of the reasoning… view at source ↗

**Figure 10.** Figure 10: Training Reward Dynamics in b1. The curves represent the sum of the proposed block entropy and block ending indicator rewards while being smoothed using a Time-Weighted Exponential Moving Average to highlight underlying trends. Results verify that the newly introduced rewards in b1 converge mostly within the first 1000 gradient steps during training view at source ↗

**Figure 11.** Figure 11: Comparison of Reward Dynamics between b1, d1 and wd1 on GSM8k. In contrast to view at source ↗

read the original abstract

Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed-size block baselines. Our code has been released at https://github.com/YanJiangJerry/Block-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns an observed entropy pattern in correct dLLM blocks into an RL reward for dynamic sizing, but the causal payoff for reasoning quality remains unproven.

read the letter

The main thing here is b1, a post-training add-on that lets diffusion LLMs pick variable block lengths during reasoning by training a policy to produce monotonically descending entropy across blocks. They start from the observation that correct traces show steady entropy drop while incorrect ones wobble, then use that as the RL objective. The plug-and-play framing with existing dLLM fine-tuning is straightforward, and releasing the code at the GitHub link is the right move for anyone who wants to test it quickly. Experiments are said to show gains over fixed-size baselines on reasoning tasks, which at least gives a concrete signal to follow up on. The entropy-based reward is grounded in their external observation rather than being pulled from the policy itself, so it avoids the worst kind of circularity. That said, the link between forcing entropy descent and actually better logical coherence is still just correlational. Nothing in the abstract derives why monotonic entropy must produce fewer reasoning breaks instead of the model simply learning to match the entropy shape while keeping the same underlying errors. Without details on the RL formulation, variance controls, or whether the gains survive stronger baselines and statistical checks, it is easy to imagine the improvements coming from diffusion artifacts or reward hacking rather than genuine coherence fixes. The stress-test point about missing causality holds: the paper does not demonstrate that optimizing the observed trend improves the output distribution in ways that matter for downstream correctness. This work is aimed at groups already running dLLM post-training and looking for adaptive generation knobs. A reader who cares about practical efficiency tweaks in block-based diffusion models could pull the code and run the ablations themselves. It is worth sending to peer review because the problem is real, the method is implementable, and the code release lets referees check the claims directly, even if the current evidence is preliminary and the mechanism needs tighter justification.

Referee Report

3 major / 1 minor

Summary. The paper claims that fixed block sizes in diffusion LLMs (dLLMs) limit reasoning coherence because optimal sizes vary across tasks and within tasks; it reports an empirical observation that correct block-wise generations exhibit monotonically descending entropy while incorrect ones fluctuate, and proposes b1, a plug-and-play RL post-training module that optimizes a Monotonic Entropy Descent objective to select dynamic block sizes, yielding consistent gains over fixed-size baselines on reasoning benchmarks.

Significance. If the empirical gains are robust and the entropy-based reward is shown to be causal rather than correlational, the work would offer a practical way to relax the rigid block partitioning in dLLMs without retraining the base model, addressing a clear bottleneck in semi-autoregressive diffusion generation. The public code release supports reproducibility.

major comments (3)

[Abstract] Abstract and empirical-observation paragraph: the central claim that monotonic entropy descent provides a reliable, stable reward signal for RL to improve reasoning coherence rests on an observed correlation; no formal derivation, causal analysis, or ablation is supplied to show that optimizing this objective produces logical improvements rather than reward hacking or diffusion-process artifacts that merely match the entropy pattern.
[Abstract] Abstract, experiments paragraph: the manuscript asserts 'extensive experiments' and 'consistent improvement' yet supplies no implementation details, exact metrics, statistical controls, number of runs, or comparison tables in the provided text, making it impossible to assess whether the data support the plug-and-play and dynamic-size claims.
The plug-and-play integration claim requires evidence that the RL module leaves the underlying dLLM sampling distribution unchanged except for block-size selection; without measurements of new inconsistencies or distribution shifts, the coherence gains cannot be isolated from potential side effects.

minor comments (1)

[Abstract] The abstract uses numbered points 1. and 2. but does not clearly separate the global vs. intra-task arguments; a short reorganization would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our empirical findings and claims. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and empirical-observation paragraph: the central claim that monotonic entropy descent provides a reliable, stable reward signal for RL to improve reasoning coherence rests on an observed correlation; no formal derivation, causal analysis, or ablation is supplied to show that optimizing this objective produces logical improvements rather than reward hacking or diffusion-process artifacts that merely match the entropy pattern.

Authors: The manuscript grounds the monotonic entropy descent property in empirical observations across reasoning tasks rather than a formal derivation. We will add ablation experiments in the revised version, including comparisons to non-monotonic entropy rewards, random block-size policies, and entropy-agnostic RL baselines, to isolate the objective's contribution and reduce the possibility of reward hacking or process artifacts. A complete theoretical derivation of why correct block-wise generations exhibit monotonic descent lies outside the current scope and would require new analysis of diffusion dynamics in LLMs; we will instead expand the discussion to link the observation to information-theoretic accumulation in coherent reasoning chains. revision: partial
Referee: [Abstract] Abstract, experiments paragraph: the manuscript asserts 'extensive experiments' and 'consistent improvement' yet supplies no implementation details, exact metrics, statistical controls, number of runs, or comparison tables in the provided text, making it impossible to assess whether the data support the plug-and-play and dynamic-size claims.

Authors: The full manuscript contains Section 4 (Experiments) with implementation details (RL hyperparameters, base dLLM models), exact metrics (accuracy on GSM8K, MATH, and coherence measures), statistical controls (5 independent seeds with standard deviations and significance tests), and full comparison tables versus fixed-size baselines. The abstract was intentionally concise; we will revise it to include key quantitative results and explicit references to the experiments section so that the claims are self-contained and verifiable from the text. revision: yes
Referee: [—] The plug-and-play integration claim requires evidence that the RL module leaves the underlying dLLM sampling distribution unchanged except for block-size selection; without measurements of new inconsistencies or distribution shifts, the coherence gains cannot be isolated from potential side effects.

Authors: We will add quantitative evidence in the revision by reporting KL divergence between the base dLLM token distribution and the b1-augmented distribution over sampled sequences, along with any observed generation inconsistencies. Because b1 acts solely as a post-training block-size selector during diffusion sampling and does not alter the underlying denoising network, we expect limited distribution shift; the new measurements will allow readers to isolate the coherence gains from side effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external empirical observation

full rationale

The paper's core chain begins with an empirical observation (block-wise entropy descends monotonically for correct reasoning but fluctuates for incorrect), which is presented as data-driven rather than derived from the model. It then constructs a Monotonic Entropy Descent objective to serve as the RL reward for learning dynamic block sizes. This objective is not self-definitional (the reward is not defined in terms of the policy outputs it optimizes), nor is any 'prediction' of coherence gains equivalent to the input by construction. Experiments on external benchmarks provide the claimed validation, and no load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in the provided text to force the result. The approach remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that entropy trends differ between correct and incorrect reasoning; no explicit free parameters, additional axioms, or invented entities are described in the abstract.

axioms (1)

domain assumption Block-wise entropy exhibits a consistent descending trend for correct reasoning and a fluctuating trend for incorrect reasoning
This is the key empirical observation stated in the abstract that motivates the monotonic entropy descent objective.

pith-pipeline@v0.9.0 · 5512 in / 1244 out tokens · 86777 ms · 2026-05-08T18:33:36.216644+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

b1 introduces Monotonic Entropy Descent (MED) via a block entropy reward, facilitating block generations with consistent descending entropy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.