Locally Coherent Parallel Decoding in Diffusion Language Models

Abbas Rahimi; Michael Hersche; Nicolas Menet; Ronan Tanios

arxiv: 2603.20216 · v2 · pith:OFIGOSNYnew · submitted 2026-03-03 · 💻 cs.CL · cs.AI· cs.LG

Locally Coherent Parallel Decoding in Diffusion Language Models

Michael Hersche , Nicolas Menet , Ronan Tanios , Abbas Rahimi This is my paper

Pith reviewed 2026-05-21 11:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diffusion language modelsparallel decodingautoregressive modelslocal coherencecode generationhybrid decoding

0 comments

The pith

CoDiLA uses a compact auxiliary autoregressive model to ensure local coherence during parallel token sampling in diffusion language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models generate multiple tokens at once to achieve sub-linear latency and support bidirectional context, yet independent sampling from marginal distributions produces syntactic inconsistencies and broken multi-token structures. The work introduces CoDiLA to reconcile parallel sampling with local dependency modeling by delegating fine-grained decoding to a small auxiliary autoregressive model that acts on the diffusion latents inside each block. The main diffusion model retains its bidirectional strength across blocks, while the auxiliary component enforces sequential validity locally. Experiments on code generation show that an auxiliary model of only 0.6 billion parameters removes coherence artifacts and improves the accuracy-speed trade-off.

Core claim

Rather than forcing the diffusion language model to resolve fine-grained syntax on its own, CoDiLA delegates local decoding to a small auxiliary autoregressive model operating on the diffusion latents. This design reconciles parallel sampling with local dependency modeling, allowing sequential validity within a block while preserving the core bidirectional modeling capabilities of the main model across blocks.

What carries the argument

CoDiLA, which integrates a compact auxiliary autoregressive model on diffusion latents to enforce local sequential validity during parallel block generation.

Load-bearing premise

That a small auxiliary autoregressive model can capture the fine-grained syntax and multi-token structures needed for local coherence.

What would settle it

Run the method on code-generation prompts and measure whether the rate of syntactic errors or broken multi-token structures drops below the rate observed in standard independent diffusion sampling.

Figures

Figures reproduced from arXiv: 2603.20216 by Abbas Rahimi, Michael Hersche, Nicolas Menet, Ronan Tanios.

**Figure 1.** Figure 1: Our CoDiLA in action. a) An example of incoherent text generated by Dream-Coder-Instruct-7B in the first iteration. Due to independent modeling of marginal distributions, it predicts the incoherent token “problem” (Top-1). b) This work enforces local coherence using a block-wise AR model conditioned on soft local tokens. In this example, it recovers coherence by retrieving the correct token “(list” from th… view at source ↗

**Figure 2.** Figure 2: CoDiLA with a block size of B = 4. This example depicts the prediction of the first block (b 1 ). First, the DLM computes the token-wise conditional marginal probability vectors (π j t ). Next, we perform soft-conditioning by computing the expected embedding (e j t ) over the AR model’s embedding matrix (Eϕ), weighted by these marginals. Finally, the AR model receives these soft tokens, encapsulated by <t… view at source ↗

**Figure 3.** Figure 3: Larger block sizes (B) reduce the training loss. We compute the average perplexity weighted by the masking ratio (see Equation (2)), and display the moving average over 10 samples. The forward process always masks blocks of 8 contiguous tokens. Ling Team, 2025), the same SFT dataset that was used for the DLM. We finetune a separate AR model for each block size for 32k steps, while keeping the DLM frozen. W… view at source ↗

**Figure 4.** Figure 4: Inference with static parallelism. We report on Pass@1 (%) vs. Throughput (tokens/sec, batch-size 1) on a single NVIDIA A100-80GB GPU. We compare the base DLM (Xie et al., 2025), ADJUST (Bansal & Sanghavi, 2025), and our CoDiLA, all built on Dream-Coder-Instruct-7B. Parallelism is controlled by unmasking a fixed number of tokens per iteration. CoDiLA consistently achieves higher accuracy at equivalent thro… view at source ↗

**Figure 5.** Figure 5: Inference with dynamic parallelism. We operate a dynamic CoDiLA (B = 4) with different entropy thresholds (τ ). better accuracy-throughput behavior than a small block-size (B = 2) with static sampling. 4.5. Ablation: Soft vs. Hard Conditioning We ablate the effectiveness of our soft-conditioning by comparing against hard-conditioning. We train a variant of the AR model that conditions only on the hard top… view at source ↗

read the original abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDiLA uses a compact auxiliary AR model on diffusion latents to handle local coherence in parallel DLM decoding, but the abstract gives no numbers or ablations to back the claims.

read the letter

Colleague, The one or two things to know about this paper are that it proposes CoDiLA, a method that uses a highly compact auxiliary autoregressive model operating on diffusion latents to fix local coherence problems during parallel token sampling in diffusion language models, and that this is positioned as a way to improve code generation by reducing syntactic inconsistencies while keeping sub-linear latency. What is actually new here is the concrete mechanism of delegating local decoding to this auxiliary AR model on the latents, which appears different from standard independent sampling or other hybrid attempts. The paper does well in identifying the core issue with DLMs failing to capture joint dependencies among concurrently generated tokens, leading to broken structures, and in suggesting a design that allows parallel generation with sequential validity within blocks and bidirectional modeling across them. It gives credit to the appeal of DLMs for code generation and editing due to their bidirectional capabilities. Where the soft spots are, the claims that the auxiliary model effectively eliminates coherence artifacts and sets a new Pareto frontier for accuracy and speed are presented without any supporting quantitative results, baselines, error bars, or ablation details in the abstract. This leaves the strength of the empirical outcome unclear at this stage. Additionally, while the paper asserts that the design maintains the main DLM's capabilities, there is no derivation or evidence shown regarding whether the local AR corrections impact cross-block attention or the overall receptive field. The stress-test concern about sequential decisions inside a block potentially propagating constraints that reduce the bidirectional context seems plausible if the models share latents closely, and it would be important to see if the full paper addresses this with specific checks. This paper is for colleagues working on diffusion-based language models and efficient parallel decoding techniques, especially in applications like code generation where both speed and correctness matter. A reader focused on practical improvements to generative models would get value from the idea, even if the details need fleshing out. It has the makings of something that deserves a serious referee to evaluate the experiments and verify the claims. Recommendation: Yes, engage with it through peer review to get a proper assessment of the results.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoDiLA, a hybrid method for diffusion language models that delegates local token dependencies and syntactic coherence to a compact auxiliary autoregressive model (e.g., 0.6B parameters) operating on diffusion latents. This enables parallel block-wise sampling while aiming to preserve the main DLM's bidirectional context modeling across blocks, with the central empirical claim being that the approach eliminates coherence artifacts and achieves a new accuracy-speed Pareto frontier on code generation tasks.

Significance. If the separation of local AR correction from global bidirectional diffusion modeling holds without side effects on receptive fields, the result would be a pragmatic advance for making DLMs competitive in latency-sensitive applications such as code generation and editing. The use of a highly compact auxiliary model is a strength that could generalize beyond the reported benchmarks.

major comments (2)

[§3] §3 (Method, CoDiLA description): the assertion that delegating local decoding to the auxiliary AR model 'maintains core DLM capabilities, including bidirectional modeling across blocks' lacks any derivation, attention-map analysis, or ablation demonstrating that sequential decisions inside a block do not propagate constraints that shrink the effective cross-block receptive field of the diffusion backbone.
[§4.2] §4.2 (Experiments, Pareto-frontier results): the claim that the 0.6B auxiliary model 'effectively eliminates coherence artifacts' and establishes a new frontier is presented without reported error bars, full baseline tables, or controls that isolate latent-update effects from the auxiliary model versus the unmodified DLM; this is load-bearing for the central empirical contribution.

minor comments (2)

[Abstract] Abstract: the specific code-generation benchmarks (e.g., HumanEval, MBPP) supporting the Pareto-frontier claim should be named explicitly.
Notation: the precise interface between diffusion latents and the auxiliary AR conditioning (update vs. read-only) would benefit from a small diagram or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our presentation. Below we address each major comment in turn.

read point-by-point responses

Referee: [§3] §3 (Method, CoDiLA description): the assertion that delegating local decoding to the auxiliary AR model 'maintains core DLM capabilities, including bidirectional modeling across blocks' lacks any derivation, attention-map analysis, or ablation demonstrating that sequential decisions inside a block do not propagate constraints that shrink the effective cross-block receptive field of the diffusion backbone.

Authors: We agree that the original manuscript would benefit from a more explicit justification of this architectural property. The method description relies on the separation of concerns—the auxiliary AR model is applied only to intra-block latents after the diffusion backbone has produced them, while cross-block conditioning remains the responsibility of the DLM—but we did not supply a formal argument or supporting analysis. In the revised version we add a short derivation in §3.2 showing that intra-block sequential decisions cannot retroactively alter the DLM’s attention over prior blocks, together with an ablation that measures cross-block attention entropy and a gradient-based receptive-field probe confirming no measurable shrinkage. revision: yes
Referee: [§4.2] §4.2 (Experiments, Pareto-frontier results): the claim that the 0.6B auxiliary model 'effectively eliminates coherence artifacts' and establishes a new frontier is presented without reported error bars, full baseline tables, or controls that isolate latent-update effects from the auxiliary model versus the unmodified DLM; this is load-bearing for the central empirical contribution.

Authors: We concur that statistical reporting and isolation of the auxiliary model’s contribution are necessary to support the central empirical claim. The submitted version contained only single-run point estimates. In the revision we have repeated all main experiments across five random seeds, added error bars to the Pareto plots and tables, expanded the baseline comparison table, and introduced a control condition that performs identical latent updates without the auxiliary AR head. The new results show that the coherence improvement is attributable to the auxiliary model rather than the latent-update procedure alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method validated externally

full rationale

The paper presents CoDiLA as an architectural design choice that delegates local syntax to a compact auxiliary AR model on diffusion latents while claiming to preserve the main DLM's bidirectional capabilities across blocks. All central assertions—elimination of coherence artifacts, parallel generation with sequential validity, and new Pareto frontier on code benchmarks—are framed as empirical demonstrations rather than mathematical derivations or predictions. No equations, fitted parameters renamed as outputs, or self-citation chains appear in the provided text that would reduce the claimed improvements to tautological inputs by construction. The result is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the design choice of a compact auxiliary model whose size and training are not derived from first principles but selected to balance coherence and speed.

free parameters (1)

auxiliary AR model size
Explicitly given as 0.6B parameters in the abstract; this is a hand-chosen design parameter to keep overhead low.

pith-pipeline@v0.9.0 · 5730 in / 1168 out tokens · 37848 ms · 2026-05-21T11:59:14.199592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2 ... smallest possible NELBO is BB := H[x0] + sum ... total correlation across blocks
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

soft-conditioning ... ej_t = sum [πj_t]v · E_ϕ(v)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[2]

URL https://openreview.net/forum? id=bFJ8Sdr224. Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Huang, Z., Lan, Z., Li, C., Li, C., Li, J., Li, Z., Liu, H., Liu, L., Lu, G., Lu, X., Ma, Y ., Tan, J., Wei, L., Wen, J.-R., Xing, Y ., Zhang, X., Zhao, J., Zheng, D., Zhou, J., Zhou, J., Zhou, Z., Zhu, L., and Zhuang, Y . LLaDA2.0: Scaling Up...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Campbell, A., Benton, J., and Bortoli, V . D. A Con- tinuous Time Framework for Discrete Denoising Mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https: //openreview.net/pdf?id=DmT862YAieY. Campbell, ...

work page 2020
[4]

Evaluating Large Language Models Trained on Code

URL https://openreview.net/forum? id=ogMTEtHO6M. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brock- man, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[5]

Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A

URL https://openreview.net/forum? id=F1AUXqDLuh. Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A. Soft-Masked Diffusion Language Models. InThe Four- teenth International Conference on Learning Represen- tations (ICLR), 2026. URL https://openreview. net/forum?id=Gba02UMvrG. Ho, J., Jain, A., and Abbeel, P. Denoising Diffu- sion Probabilistic Models...

work page 2026
[6]

S., Seo, J.-s., Zhang, Z., and Gupta, U

doi: 10.48550/arXiv.2505.21467. URL https: //openreview.net/forum?id=KUfKvlX3VY. Huang, F., Tao, T., Zhou, H., Li, L., and Huang, M. On the learning of non-autoregressive transformers. InInterna- tional Conference on Machine Learning (ICML), volume

work page doi:10.48550/arxiv.2505.21467
[7]

Mercury: Ultra-Fast Language Models Based on Diffusion

PMLR, 2022. URL https://proceedings. mlr.press/v162/huang22k.html. Inception, L., Khanna, S., Kharbanda, S., Li, S., Varma, H., Wang, E., Birnbaum, S., Luo, Z., Miraoui, Y ., Palrecha, A., Ermon, S., Grover, A., and Kuleshov, V . Mercury: Ultra-Fast Language Models Based on Diffusion.arXiv preprint arXiv:2506.17298, 2025. doi: 10.48550/arXiv. 2506.17298. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
[8]

Z., Kim, H., Kakade, S., and Chen, S

URL https://openreview.net/forum? id=OsZr5T7Cd0. Kim, J., Kim, S., Lee, T., Pan, D. Z., Kim, H., Kakade, S., and Chen, S. Fine-Tuning Masked Diffusion for Prov- able Self-Correction.arXiv preprint arXiv:2510.01384, October 2025. doi: 10.48550/arXiv.2510.01384. URL http://arxiv.org/abs/2510.01384. Kong, F., Zhang, J., Liu, Y ., Wu, Z., Tian, Y ., W, V ., a...

work page doi:10.48550/arxiv.2510.01384 2025
[9]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

URL https://openreview.net/forum? id=1qvx610Cu7. Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P. TiDAR: Think in Diffusion, Talk in Autoregression.arXiv preprint arXiv:2511.08923, November 2025b. doi: 10.48550/ arXiv.2511.08923. URL http://arxiv.org/abs/ 2511.08923. Liu, Z., Yang, Y ., Zhang, Y ., Chen, J...

work page doi:10.48550/arxiv.2506.06295 2024
[10]

URL http: //arxiv.org/abs/2510.08369

doi: 10.48550/arXiv.2510.08369. URL http: //arxiv.org/abs/2510.08369. Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion Language Models are Super Data Learners.arXiv preprint arXiv:2511.03276, November 2025a. doi: 10.48550/arXiv.2511.03276. URL http://arxiv.org/abs/2511.03276. Ni, J., Liu, Q., Du, C., Dou, L., Yan, ...

work page doi:10.48550/arxiv.2510.08369 2025
[11]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

URL https://openreview.net/forum? id=GDYaNzxt9T. Sahoo, S. S., Arriola, M., and Schiff, Y . Simple and Effec- tive Masked Diffusion Language Models. InAdvances in 11 Locally Coherent Parallel Decoding in Diffusion Language Models Neural Information Processing Systems (NeurIPS), vol- ume 38, 2024. URL https://openreview.net/ forum?id=L4uaAR4ArM. Shi, J., H...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02193 2024
[12]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. Wang, G., Schiff, Y ., Sahoo, S. S., and Kuleshov, V . Remasking Discrete Diffusion Models with Inference- Time Scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS),

work page 2017
[13]

Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z

URL https://openreview.net/forum? id=IJryQAOy0p. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing. InThe Fourteenth Interna- tional Conference on Learning Representations (ICLR),

work page
[14]

K., Garcia Lambas, D., Ruiz, A

URL https://openreview.net/forum? id=t5uLZSRjhF. Wei, Q., Zhang, Y ., Liu, Z., Liu, D., and Zhang, L. Ac- celerating Diffusion Large Language Models with Slow- Fast Sampling: The Three Golden Principles. InThe Fourteenth International Conference on Learning Rep- resentations (ICLR), 2026. doi: 10.48550/arXiv.2506. 10848. URL https://openreview.net/forum? ...

work page doi:10.48550/arxiv.2506 2026
[15]

and Zhang, J

doi: 10.48550/arXiv.2510.00294. URL http: //arxiv.org/abs/2510.00294. Xie, Z., Ye, J., Zheng, L., Gao, J., Dong, J., Wu, Z., Zhao, X., Gong, S., Jiang, X., Li, Z., and Kong, L. Dream-Coder 7B: An Open Diffusion Language Model for Code.arXiv preprint arXiv:2509.01142, September 2025. doi: 10. 48550/arXiv.2509.01142. URL http://arxiv.org/ abs/2509.01142. Ya...

work page doi:10.48550/arxiv.2510.00294 2025
[16]

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

URL https://aclanthology.org/2025. emnlp-main.1597/. Zhang, S., Peng, F. Z., Zhang, Y ., Pan, J., and Chrysos, G. G. Corrective Diffusion Language Models.arXiv preprint arXiv:2512.15596, December 2025b. doi: 10. 48550/arXiv.2512.15596. URL http://arxiv.org/ abs/2512.15596. Zhong, Y ., Gu, Y ., Zang, Z., Li, X., Ding, Y ., Jia, X., Shen, Y ., Lan, Z., Zhu,...

work page internal anchor Pith review doi:10.48550/arxiv.2601.15593 2025
[17]

Training roles.The auxiliaryAR model is Qwen3-0.6B(finetuned end-to-end), while theDLM model (Dream- Coder-Instruct-7B) is kept frozen

work page
[18]

Tokenizer/interface alignment.We use theDream-Coder tokenizerfor templating and masking, and perform soft-conditioningby multiplying the diffusion marginals with the AR embedding matrix;we do not remap token IDs

work page
[19]

Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8

Masking granularity.We replace token-wise masking withnon-overlapping block-level maskingover the response with fixed block sizeB∈ {2,4,8}. Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8. We enablebf16and gradient checkpointing. A.1.1. TRAININGROLES: AR TRAINED, DIFFUSIO...

work page
[20]

• EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al

and the sanitized version of the MBPP dataset (Austin et al., 2021b). • EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al. (2023). For MBPP+, we execute the complete testing pipeline to maximize verification depth. • BigCodeBench: We evaluated on bothFullandHardsplits using the offici...

work page 2023
[21]

Interface & Latent Projection: Diffusion output probabilities over the Dream-Coder vocabulary are projected into the AR embedding space by computing the expected embedding relative to the AR embedding matrix. To ensure a clear termination signal, we apply a discretization step at the sequence boundary: if the diffusion model predicts the EOS token, we rep...

work page
[22]

For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens

Decoding Scope: At each denoising iteration, the model evaluates a scope of up to 10 masked blocks. For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens. As shown in Table 2, increasing the scope results in a drop in accuracy due to prematureEOSprediction. Yet, the throughput is maintained within 15%

work page
[23]

The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction

Lowest-Entropy Unmasking: We calculate the average per-token entropy provided by the AR executor for each candidate block. The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction. This yields a static parallelism ofBtokens per iteration. Implementation Details.All experiments are conduc...

work page
[24]

(2020); Sohl-Dickstein et al

Decomposition of the factorization gap Following Ho et al. (2020); Sohl-Dickstein et al. (2015), the negative ELBOL NELBO can be decomposed as follows: LNELBO =E x0∼q[Lx0 NELBO] =E x0:T ∼q log q(x1:T |x0) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(x0:T ) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(xT ) pθ(xT ) +PT t=1 log q(xt−1|xt) pθ(xt−1|xt) =H[x 0] +D KL(q(xT )∥p θ(xT...

work page 2020
[25]

Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 ,

Decrease in NELBO for non-trivial block sizes We now quantify the reduction in the lower bound BB relative to the token-level baseline B1. Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 , . . . , xi·B t−1]. Subtracting the expression for BB derived in Part 1 from the expression forB 1 (where block size is 1), the terms...

work page
[26]

Sufficiency of Soft-Conditioning:If pAR ϕ is conditioned on the full marginals π, there exists a parameterization ϕ such thatp AR ϕ (· |π) =q(·)

work page
[27]

Fr´echet Class Restriction:Let πtop-k be the marginals truncated to the k most likely tokens at each position. Condition- ing on πtop-k restricts the valid solution space to the constrained Fr´echet class F(π top-k), strictly limiting the support of any recoverable distribution to the Cartesian product of the top-ksets

work page
[28]

There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class

Exclusion of the Global Mode:This restriction introduces an irreducible bias. There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class. Formally: ∃qsuch that∀q ′ ∈ F(π top-k), q ′(b∗) = 0< q(b ∗).(6) Thus, high-probability coherent structures can be rendered unrecoverable solel...

work page
[29]

Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k

Fr´echet Class Restriction.Let S i k ={t∈ V|rank(t, π i)≤k} be the set of top-k tokens at position i. Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k. Since the probability mass of tokens outside S i k is effectively zeroed out in the input, any valid joi...

work page
[30]

Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}

Exclusion of the Global Mode.We demonstrate this exclusion via a counter-example. Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}. Consider a distributionq(x 1, x2)with the following probability mass function: •q(Roger, Roger) = 0.45(The coherent modeb ∗). •q(Houston, Y ou) = 0.25. •q(Houston, I) = 0.25. •q(Houston, T hey) = 0.05. The induced mar...

work page 2024
[31]

=α tδxi t,xi 0 + (1−α t)δxi t,[MASK], a direct application of Bayes’ theorem results in the closed-form expression q(xi t−1|xi t, xi

work page
[32]

=    1ifx i t−1 =x i t =x i 0, 1−αt−1 1−αt ifx i t−1 =x i t =[MASK], αt−1−αt 1−αt ifx i t−1 =x i 0, xi t =[MASK], 0otherwise . 18 Locally Coherent Parallel Decoding in Diffusion Language Models We introduce the abbreviationDand apply position-wise independence ofq(x t−1|xt,x 0)andp θ(xt−1|xt): D:=D KL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) =E q(xt−1|xt,x0)...

work page
[33]

Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt))

=p θ(xt−1|xt) =δ xi t−1,xi 0 . Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt)). Now, note that for xi t−1 =[MASK] , the probability q(xi t−1|xi t, xi

work page
[34]

does not depend on xi 0, resulting in pθ(xi t−1|xt) = q(xi t−1|xi t, xi 0)∀x i

work page
[35]

Therefore, the only non-vanishing additive term in the KL divergence occurs whenx i t−1 =x i 0, i.e., D= LX i=1 δxi t=[MASK]q(xi t−1 =x i 0|xi t =[MASK], x i

work page
[36]

Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK]

log q(xi t−1 =x i 0|xi t =[MASK], x i 0) pθ(xi t−1 =x i 0|xt) . Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK] . First, note that q(xi t−1 =x i 0|xi t =[MASK],˜xi

work page
[37]

= αt−1−αt 1−αt δxi 0,˜xi 0 + 1−αt−1 1−αt δxi 0,[MASK]. Then, due to the assumption thatp θ(xi 0 =[MASK]|x t) = 0, it follows that pθ(xi t−1 =x i 0|xt) =E pθ(˜xi 0|xt)[q(xi t−1 =x i 0|xi t =[MASK],˜xi 0)] =p θ(xi 0|xt)q(xi t−1 =x i 0|xi t =[MASK], x i 0). Plugging in and canceling equal factors then results in the desired expression D= LX i=1 −δxi t=[MASK]...

work page

[1] [2]

URL https://openreview.net/forum? id=bFJ8Sdr224. Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Huang, Z., Lan, Z., Li, C., Li, C., Li, J., Li, Z., Liu, H., Liu, L., Lu, G., Lu, X., Ma, Y ., Tan, J., Wei, L., Wen, J.-R., Xing, Y ., Zhang, X., Zhao, J., Zheng, D., Zhou, J., Zhou, J., Zhou, Z., Zhu, L., and Zhuang, Y . LLaDA2.0: Scaling Up...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [3]

cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Campbell, A., Benton, J., and Bortoli, V . D. A Con- tinuous Time Framework for Discrete Denoising Mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https: //openreview.net/pdf?id=DmT862YAieY. Campbell, ...

work page 2020

[3] [4]

Evaluating Large Language Models Trained on Code

URL https://openreview.net/forum? id=ogMTEtHO6M. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brock- man, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021

[4] [5]

Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A

URL https://openreview.net/forum? id=F1AUXqDLuh. Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A. Soft-Masked Diffusion Language Models. InThe Four- teenth International Conference on Learning Represen- tations (ICLR), 2026. URL https://openreview. net/forum?id=Gba02UMvrG. Ho, J., Jain, A., and Abbeel, P. Denoising Diffu- sion Probabilistic Models...

work page 2026

[5] [6]

S., Seo, J.-s., Zhang, Z., and Gupta, U

doi: 10.48550/arXiv.2505.21467. URL https: //openreview.net/forum?id=KUfKvlX3VY. Huang, F., Tao, T., Zhou, H., Li, L., and Huang, M. On the learning of non-autoregressive transformers. InInterna- tional Conference on Machine Learning (ICML), volume

work page doi:10.48550/arxiv.2505.21467

[6] [7]

Mercury: Ultra-Fast Language Models Based on Diffusion

PMLR, 2022. URL https://proceedings. mlr.press/v162/huang22k.html. Inception, L., Khanna, S., Kharbanda, S., Li, S., Varma, H., Wang, E., Birnbaum, S., Luo, Z., Miraoui, Y ., Palrecha, A., Ermon, S., Grover, A., and Kuleshov, V . Mercury: Ultra-Fast Language Models Based on Diffusion.arXiv preprint arXiv:2506.17298, 2025. doi: 10.48550/arXiv. 2506.17298. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022

[7] [8]

Z., Kim, H., Kakade, S., and Chen, S

URL https://openreview.net/forum? id=OsZr5T7Cd0. Kim, J., Kim, S., Lee, T., Pan, D. Z., Kim, H., Kakade, S., and Chen, S. Fine-Tuning Masked Diffusion for Prov- able Self-Correction.arXiv preprint arXiv:2510.01384, October 2025. doi: 10.48550/arXiv.2510.01384. URL http://arxiv.org/abs/2510.01384. Kong, F., Zhang, J., Liu, Y ., Wu, Z., Tian, Y ., W, V ., a...

work page doi:10.48550/arxiv.2510.01384 2025

[8] [9]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

URL https://openreview.net/forum? id=1qvx610Cu7. Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P. TiDAR: Think in Diffusion, Talk in Autoregression.arXiv preprint arXiv:2511.08923, November 2025b. doi: 10.48550/ arXiv.2511.08923. URL http://arxiv.org/abs/ 2511.08923. Liu, Z., Yang, Y ., Zhang, Y ., Chen, J...

work page doi:10.48550/arxiv.2506.06295 2024

[9] [10]

URL http: //arxiv.org/abs/2510.08369

doi: 10.48550/arXiv.2510.08369. URL http: //arxiv.org/abs/2510.08369. Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion Language Models are Super Data Learners.arXiv preprint arXiv:2511.03276, November 2025a. doi: 10.48550/arXiv.2511.03276. URL http://arxiv.org/abs/2511.03276. Ni, J., Liu, Q., Du, C., Dou, L., Yan, ...

work page doi:10.48550/arxiv.2510.08369 2025

[10] [11]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

URL https://openreview.net/forum? id=GDYaNzxt9T. Sahoo, S. S., Arriola, M., and Schiff, Y . Simple and Effec- tive Masked Diffusion Language Models. InAdvances in 11 Locally Coherent Parallel Decoding in Diffusion Language Models Neural Information Processing Systems (NeurIPS), vol- ume 38, 2024. URL https://openreview.net/ forum?id=L4uaAR4ArM. Shi, J., H...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02193 2024

[11] [12]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. Wang, G., Schiff, Y ., Sahoo, S. S., and Kuleshov, V . Remasking Discrete Diffusion Models with Inference- Time Scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS),

work page 2017

[12] [13]

Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z

URL https://openreview.net/forum? id=IJryQAOy0p. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing. InThe Fourteenth Interna- tional Conference on Learning Representations (ICLR),

work page

[13] [14]

K., Garcia Lambas, D., Ruiz, A

URL https://openreview.net/forum? id=t5uLZSRjhF. Wei, Q., Zhang, Y ., Liu, Z., Liu, D., and Zhang, L. Ac- celerating Diffusion Large Language Models with Slow- Fast Sampling: The Three Golden Principles. InThe Fourteenth International Conference on Learning Rep- resentations (ICLR), 2026. doi: 10.48550/arXiv.2506. 10848. URL https://openreview.net/forum? ...

work page doi:10.48550/arxiv.2506 2026

[14] [15]

and Zhang, J

doi: 10.48550/arXiv.2510.00294. URL http: //arxiv.org/abs/2510.00294. Xie, Z., Ye, J., Zheng, L., Gao, J., Dong, J., Wu, Z., Zhao, X., Gong, S., Jiang, X., Li, Z., and Kong, L. Dream-Coder 7B: An Open Diffusion Language Model for Code.arXiv preprint arXiv:2509.01142, September 2025. doi: 10. 48550/arXiv.2509.01142. URL http://arxiv.org/ abs/2509.01142. Ya...

work page doi:10.48550/arxiv.2510.00294 2025

[15] [16]

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

URL https://aclanthology.org/2025. emnlp-main.1597/. Zhang, S., Peng, F. Z., Zhang, Y ., Pan, J., and Chrysos, G. G. Corrective Diffusion Language Models.arXiv preprint arXiv:2512.15596, December 2025b. doi: 10. 48550/arXiv.2512.15596. URL http://arxiv.org/ abs/2512.15596. Zhong, Y ., Gu, Y ., Zang, Z., Li, X., Ding, Y ., Jia, X., Shen, Y ., Lan, Z., Zhu,...

work page internal anchor Pith review doi:10.48550/arxiv.2601.15593 2025

[16] [17]

Training roles.The auxiliaryAR model is Qwen3-0.6B(finetuned end-to-end), while theDLM model (Dream- Coder-Instruct-7B) is kept frozen

work page

[17] [18]

Tokenizer/interface alignment.We use theDream-Coder tokenizerfor templating and masking, and perform soft-conditioningby multiplying the diffusion marginals with the AR embedding matrix;we do not remap token IDs

work page

[18] [19]

Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8

Masking granularity.We replace token-wise masking withnon-overlapping block-level maskingover the response with fixed block sizeB∈ {2,4,8}. Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8. We enablebf16and gradient checkpointing. A.1.1. TRAININGROLES: AR TRAINED, DIFFUSIO...

work page

[19] [20]

• EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al

and the sanitized version of the MBPP dataset (Austin et al., 2021b). • EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al. (2023). For MBPP+, we execute the complete testing pipeline to maximize verification depth. • BigCodeBench: We evaluated on bothFullandHardsplits using the offici...

work page 2023

[20] [21]

Interface & Latent Projection: Diffusion output probabilities over the Dream-Coder vocabulary are projected into the AR embedding space by computing the expected embedding relative to the AR embedding matrix. To ensure a clear termination signal, we apply a discretization step at the sequence boundary: if the diffusion model predicts the EOS token, we rep...

work page

[21] [22]

For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens

Decoding Scope: At each denoising iteration, the model evaluates a scope of up to 10 masked blocks. For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens. As shown in Table 2, increasing the scope results in a drop in accuracy due to prematureEOSprediction. Yet, the throughput is maintained within 15%

work page

[22] [23]

The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction

Lowest-Entropy Unmasking: We calculate the average per-token entropy provided by the AR executor for each candidate block. The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction. This yields a static parallelism ofBtokens per iteration. Implementation Details.All experiments are conduc...

work page

[23] [24]

(2020); Sohl-Dickstein et al

Decomposition of the factorization gap Following Ho et al. (2020); Sohl-Dickstein et al. (2015), the negative ELBOL NELBO can be decomposed as follows: LNELBO =E x0∼q[Lx0 NELBO] =E x0:T ∼q log q(x1:T |x0) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(x0:T ) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(xT ) pθ(xT ) +PT t=1 log q(xt−1|xt) pθ(xt−1|xt) =H[x 0] +D KL(q(xT )∥p θ(xT...

work page 2020

[24] [25]

Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 ,

Decrease in NELBO for non-trivial block sizes We now quantify the reduction in the lower bound BB relative to the token-level baseline B1. Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 , . . . , xi·B t−1]. Subtracting the expression for BB derived in Part 1 from the expression forB 1 (where block size is 1), the terms...

work page

[25] [26]

Sufficiency of Soft-Conditioning:If pAR ϕ is conditioned on the full marginals π, there exists a parameterization ϕ such thatp AR ϕ (· |π) =q(·)

work page

[26] [27]

Fr´echet Class Restriction:Let πtop-k be the marginals truncated to the k most likely tokens at each position. Condition- ing on πtop-k restricts the valid solution space to the constrained Fr´echet class F(π top-k), strictly limiting the support of any recoverable distribution to the Cartesian product of the top-ksets

work page

[27] [28]

There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class

Exclusion of the Global Mode:This restriction introduces an irreducible bias. There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class. Formally: ∃qsuch that∀q ′ ∈ F(π top-k), q ′(b∗) = 0< q(b ∗).(6) Thus, high-probability coherent structures can be rendered unrecoverable solel...

work page

[28] [29]

Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k

Fr´echet Class Restriction.Let S i k ={t∈ V|rank(t, π i)≤k} be the set of top-k tokens at position i. Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k. Since the probability mass of tokens outside S i k is effectively zeroed out in the input, any valid joi...

work page

[29] [30]

Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}

Exclusion of the Global Mode.We demonstrate this exclusion via a counter-example. Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}. Consider a distributionq(x 1, x2)with the following probability mass function: •q(Roger, Roger) = 0.45(The coherent modeb ∗). •q(Houston, Y ou) = 0.25. •q(Houston, I) = 0.25. •q(Houston, T hey) = 0.05. The induced mar...

work page 2024

[30] [31]

=α tδxi t,xi 0 + (1−α t)δxi t,[MASK], a direct application of Bayes’ theorem results in the closed-form expression q(xi t−1|xi t, xi

work page

[31] [32]

=    1ifx i t−1 =x i t =x i 0, 1−αt−1 1−αt ifx i t−1 =x i t =[MASK], αt−1−αt 1−αt ifx i t−1 =x i 0, xi t =[MASK], 0otherwise . 18 Locally Coherent Parallel Decoding in Diffusion Language Models We introduce the abbreviationDand apply position-wise independence ofq(x t−1|xt,x 0)andp θ(xt−1|xt): D:=D KL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) =E q(xt−1|xt,x0)...

work page

[32] [33]

Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt))

=p θ(xt−1|xt) =δ xi t−1,xi 0 . Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt)). Now, note that for xi t−1 =[MASK] , the probability q(xi t−1|xi t, xi

work page

[33] [34]

does not depend on xi 0, resulting in pθ(xi t−1|xt) = q(xi t−1|xi t, xi 0)∀x i

work page

[34] [35]

Therefore, the only non-vanishing additive term in the KL divergence occurs whenx i t−1 =x i 0, i.e., D= LX i=1 δxi t=[MASK]q(xi t−1 =x i 0|xi t =[MASK], x i

work page

[35] [36]

Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK]

log q(xi t−1 =x i 0|xi t =[MASK], x i 0) pθ(xi t−1 =x i 0|xt) . Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK] . First, note that q(xi t−1 =x i 0|xi t =[MASK],˜xi

work page

[36] [37]

= αt−1−αt 1−αt δxi 0,˜xi 0 + 1−αt−1 1−αt δxi 0,[MASK]. Then, due to the assumption thatp θ(xi 0 =[MASK]|x t) = 0, it follows that pθ(xi t−1 =x i 0|xt) =E pθ(˜xi 0|xt)[q(xi t−1 =x i 0|xi t =[MASK],˜xi 0)] =p θ(xi 0|xt)q(xi t−1 =x i 0|xi t =[MASK], x i 0). Plugging in and canceling equal factors then results in the desired expression D= LX i=1 −δxi t=[MASK]...

work page