pith. sign in

arxiv: 2603.20216 · v2 · pith:OFIGOSNYnew · submitted 2026-03-03 · 💻 cs.CL · cs.AI· cs.LG

Locally Coherent Parallel Decoding in Diffusion Language Models

Pith reviewed 2026-05-21 11:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords diffusion language modelsparallel decodingautoregressive modelslocal coherencecode generationhybrid decoding
0
0 comments X

The pith

CoDiLA uses a compact auxiliary autoregressive model to ensure local coherence during parallel token sampling in diffusion language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models generate multiple tokens at once to achieve sub-linear latency and support bidirectional context, yet independent sampling from marginal distributions produces syntactic inconsistencies and broken multi-token structures. The work introduces CoDiLA to reconcile parallel sampling with local dependency modeling by delegating fine-grained decoding to a small auxiliary autoregressive model that acts on the diffusion latents inside each block. The main diffusion model retains its bidirectional strength across blocks, while the auxiliary component enforces sequential validity locally. Experiments on code generation show that an auxiliary model of only 0.6 billion parameters removes coherence artifacts and improves the accuracy-speed trade-off.

Core claim

Rather than forcing the diffusion language model to resolve fine-grained syntax on its own, CoDiLA delegates local decoding to a small auxiliary autoregressive model operating on the diffusion latents. This design reconciles parallel sampling with local dependency modeling, allowing sequential validity within a block while preserving the core bidirectional modeling capabilities of the main model across blocks.

What carries the argument

CoDiLA, which integrates a compact auxiliary autoregressive model on diffusion latents to enforce local sequential validity during parallel block generation.

Load-bearing premise

That a small auxiliary autoregressive model can capture the fine-grained syntax and multi-token structures needed for local coherence.

What would settle it

Run the method on code-generation prompts and measure whether the rate of syntactic errors or broken multi-token structures drops below the rate observed in standard independent diffusion sampling.

Figures

Figures reproduced from arXiv: 2603.20216 by Abbas Rahimi, Michael Hersche, Nicolas Menet, Ronan Tanios.

Figure 1
Figure 1. Figure 1: Our CoDiLA in action. a) An example of incoherent text generated by Dream-Coder-Instruct-7B in the first iteration. Due to independent modeling of marginal distributions, it predicts the incoherent token “problem” (Top-1). b) This work enforces local coherence using a block-wise AR model conditioned on soft local tokens. In this example, it recovers coherence by retrieving the correct token “(list” from th… view at source ↗
Figure 2
Figure 2. Figure 2: CoDiLA with a block size of B = 4. This example depicts the prediction of the first block (b 1 ). First, the DLM com￾putes the token-wise conditional marginal probability vectors (π j t ). Next, we perform soft-conditioning by computing the expected embedding (e j t ) over the AR model’s embedding matrix (Eϕ), weighted by these marginals. Finally, the AR model receives these soft tokens, encapsulated by <t… view at source ↗
Figure 3
Figure 3. Figure 3: Larger block sizes (B) reduce the training loss. We compute the average perplexity weighted by the masking ratio (see Equation (2)), and display the moving average over 10 samples. The forward process always masks blocks of 8 contiguous tokens. Ling Team, 2025), the same SFT dataset that was used for the DLM. We finetune a separate AR model for each block size for 32k steps, while keeping the DLM frozen. W… view at source ↗
Figure 4
Figure 4. Figure 4: Inference with static parallelism. We report on Pass@1 (%) vs. Throughput (tokens/sec, batch-size 1) on a single NVIDIA A100-80GB GPU. We compare the base DLM (Xie et al., 2025), ADJUST (Bansal & Sanghavi, 2025), and our CoDiLA, all built on Dream-Coder-Instruct-7B. Parallelism is controlled by unmasking a fixed number of tokens per iteration. CoDiLA consistently achieves higher accuracy at equivalent thro… view at source ↗
Figure 5
Figure 5. Figure 5: Inference with dynamic parallelism. We operate a dynamic CoDiLA (B = 4) with different entropy thresholds (τ ). better accuracy-throughput behavior than a small block-size (B = 2) with static sampling. 4.5. Ablation: Soft vs. Hard Conditioning We ablate the effectiveness of our soft-conditioning by com￾paring against hard-conditioning. We train a variant of the AR model that conditions only on the hard top… view at source ↗
read the original abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoDiLA, a hybrid method for diffusion language models that delegates local token dependencies and syntactic coherence to a compact auxiliary autoregressive model (e.g., 0.6B parameters) operating on diffusion latents. This enables parallel block-wise sampling while aiming to preserve the main DLM's bidirectional context modeling across blocks, with the central empirical claim being that the approach eliminates coherence artifacts and achieves a new accuracy-speed Pareto frontier on code generation tasks.

Significance. If the separation of local AR correction from global bidirectional diffusion modeling holds without side effects on receptive fields, the result would be a pragmatic advance for making DLMs competitive in latency-sensitive applications such as code generation and editing. The use of a highly compact auxiliary model is a strength that could generalize beyond the reported benchmarks.

major comments (2)
  1. [§3] §3 (Method, CoDiLA description): the assertion that delegating local decoding to the auxiliary AR model 'maintains core DLM capabilities, including bidirectional modeling across blocks' lacks any derivation, attention-map analysis, or ablation demonstrating that sequential decisions inside a block do not propagate constraints that shrink the effective cross-block receptive field of the diffusion backbone.
  2. [§4.2] §4.2 (Experiments, Pareto-frontier results): the claim that the 0.6B auxiliary model 'effectively eliminates coherence artifacts' and establishes a new frontier is presented without reported error bars, full baseline tables, or controls that isolate latent-update effects from the auxiliary model versus the unmodified DLM; this is load-bearing for the central empirical contribution.
minor comments (2)
  1. [Abstract] Abstract: the specific code-generation benchmarks (e.g., HumanEval, MBPP) supporting the Pareto-frontier claim should be named explicitly.
  2. Notation: the precise interface between diffusion latents and the auxiliary AR conditioning (update vs. read-only) would benefit from a small diagram or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our presentation. Below we address each major comment in turn.

read point-by-point responses
  1. Referee: [§3] §3 (Method, CoDiLA description): the assertion that delegating local decoding to the auxiliary AR model 'maintains core DLM capabilities, including bidirectional modeling across blocks' lacks any derivation, attention-map analysis, or ablation demonstrating that sequential decisions inside a block do not propagate constraints that shrink the effective cross-block receptive field of the diffusion backbone.

    Authors: We agree that the original manuscript would benefit from a more explicit justification of this architectural property. The method description relies on the separation of concerns—the auxiliary AR model is applied only to intra-block latents after the diffusion backbone has produced them, while cross-block conditioning remains the responsibility of the DLM—but we did not supply a formal argument or supporting analysis. In the revised version we add a short derivation in §3.2 showing that intra-block sequential decisions cannot retroactively alter the DLM’s attention over prior blocks, together with an ablation that measures cross-block attention entropy and a gradient-based receptive-field probe confirming no measurable shrinkage. revision: yes

  2. Referee: [§4.2] §4.2 (Experiments, Pareto-frontier results): the claim that the 0.6B auxiliary model 'effectively eliminates coherence artifacts' and establishes a new frontier is presented without reported error bars, full baseline tables, or controls that isolate latent-update effects from the auxiliary model versus the unmodified DLM; this is load-bearing for the central empirical contribution.

    Authors: We concur that statistical reporting and isolation of the auxiliary model’s contribution are necessary to support the central empirical claim. The submitted version contained only single-run point estimates. In the revision we have repeated all main experiments across five random seeds, added error bars to the Pareto plots and tables, expanded the baseline comparison table, and introduced a control condition that performs identical latent updates without the auxiliary AR head. The new results show that the coherence improvement is attributable to the auxiliary model rather than the latent-update procedure alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method validated externally

full rationale

The paper presents CoDiLA as an architectural design choice that delegates local syntax to a compact auxiliary AR model on diffusion latents while claiming to preserve the main DLM's bidirectional capabilities across blocks. All central assertions—elimination of coherence artifacts, parallel generation with sequential validity, and new Pareto frontier on code benchmarks—are framed as empirical demonstrations rather than mathematical derivations or predictions. No equations, fitted parameters renamed as outputs, or self-citation chains appear in the provided text that would reduce the claimed improvements to tautological inputs by construction. The result is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the design choice of a compact auxiliary model whose size and training are not derived from first principles but selected to balance coherence and speed.

free parameters (1)
  • auxiliary AR model size
    Explicitly given as 0.6B parameters in the abstract; this is a hand-chosen design parameter to keep overhead low.

pith-pipeline@v0.9.0 · 5730 in / 1168 out tokens · 37848 ms · 2026-05-21T11:59:14.199592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

  1. [2]

    URL https://openreview.net/forum? id=bFJ8Sdr224. Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Huang, Z., Lan, Z., Li, C., Li, C., Li, J., Li, Z., Liu, H., Liu, L., Lu, G., Lu, X., Ma, Y ., Tan, J., Wei, L., Wen, J.-R., Xing, Y ., Zhang, X., Zhao, J., Zheng, D., Zhou, J., Zhou, J., Zhou, Z., Zhu, L., and Zhuang, Y . LLaDA2.0: Scaling Up...

  2. [3]

    cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Campbell, A., Benton, J., and Bortoli, V . D. A Con- tinuous Time Framework for Discrete Denoising Mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https: //openreview.net/pdf?id=DmT862YAieY. Campbell, ...

  3. [4]

    Evaluating Large Language Models Trained on Code

    URL https://openreview.net/forum? id=ogMTEtHO6M. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brock- man, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet,...

  4. [5]

    Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A

    URL https://openreview.net/forum? id=F1AUXqDLuh. Hersche, M., Moor-Smith, S., Hofmann, T., and Rahimi, A. Soft-Masked Diffusion Language Models. InThe Four- teenth International Conference on Learning Represen- tations (ICLR), 2026. URL https://openreview. net/forum?id=Gba02UMvrG. Ho, J., Jain, A., and Abbeel, P. Denoising Diffu- sion Probabilistic Models...

  5. [6]

    S., Seo, J.-s., Zhang, Z., and Gupta, U

    doi: 10.48550/arXiv.2505.21467. URL https: //openreview.net/forum?id=KUfKvlX3VY. Huang, F., Tao, T., Zhou, H., Li, L., and Huang, M. On the learning of non-autoregressive transformers. InInterna- tional Conference on Machine Learning (ICML), volume

  6. [7]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    PMLR, 2022. URL https://proceedings. mlr.press/v162/huang22k.html. Inception, L., Khanna, S., Kharbanda, S., Li, S., Varma, H., Wang, E., Birnbaum, S., Luo, Z., Miraoui, Y ., Palrecha, A., Ermon, S., Grover, A., and Kuleshov, V . Mercury: Ultra-Fast Language Models Based on Diffusion.arXiv preprint arXiv:2506.17298, 2025. doi: 10.48550/arXiv. 2506.17298. ...

  7. [8]

    Z., Kim, H., Kakade, S., and Chen, S

    URL https://openreview.net/forum? id=OsZr5T7Cd0. Kim, J., Kim, S., Lee, T., Pan, D. Z., Kim, H., Kakade, S., and Chen, S. Fine-Tuning Masked Diffusion for Prov- able Self-Correction.arXiv preprint arXiv:2510.01384, October 2025. doi: 10.48550/arXiv.2510.01384. URL http://arxiv.org/abs/2510.01384. Kong, F., Zhang, J., Liu, Y ., Wu, Z., Tian, Y ., W, V ., a...

  8. [9]

    Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

    URL https://openreview.net/forum? id=1qvx610Cu7. Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P. TiDAR: Think in Diffusion, Talk in Autoregression.arXiv preprint arXiv:2511.08923, November 2025b. doi: 10.48550/ arXiv.2511.08923. URL http://arxiv.org/abs/ 2511.08923. Liu, Z., Yang, Y ., Zhang, Y ., Chen, J...

  9. [10]

    URL http: //arxiv.org/abs/2510.08369

    doi: 10.48550/arXiv.2510.08369. URL http: //arxiv.org/abs/2510.08369. Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion Language Models are Super Data Learners.arXiv preprint arXiv:2511.03276, November 2025a. doi: 10.48550/arXiv.2511.03276. URL http://arxiv.org/abs/2511.03276. Ni, J., Liu, Q., Du, C., Dou, L., Yan, ...

  10. [11]

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    URL https://openreview.net/forum? id=GDYaNzxt9T. Sahoo, S. S., Arriola, M., and Schiff, Y . Simple and Effec- tive Masked Diffusion Language Models. InAdvances in 11 Locally Coherent Parallel Decoding in Diffusion Language Models Neural Information Processing Systems (NeurIPS), vol- ume 38, 2024. URL https://openreview.net/ forum?id=L4uaAR4ArM. Shi, J., H...

  11. [12]

    cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. Wang, G., Schiff, Y ., Sahoo, S. S., and Kuleshov, V . Remasking Discrete Diffusion Models with Inference- Time Scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS),

  12. [13]

    Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z

    URL https://openreview.net/forum? id=IJryQAOy0p. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing. InThe Fourteenth Interna- tional Conference on Learning Representations (ICLR),

  13. [14]

    K., Garcia Lambas, D., Ruiz, A

    URL https://openreview.net/forum? id=t5uLZSRjhF. Wei, Q., Zhang, Y ., Liu, Z., Liu, D., and Zhang, L. Ac- celerating Diffusion Large Language Models with Slow- Fast Sampling: The Three Golden Principles. InThe Fourteenth International Conference on Learning Rep- resentations (ICLR), 2026. doi: 10.48550/arXiv.2506. 10848. URL https://openreview.net/forum? ...

  14. [15]

    and Zhang, J

    doi: 10.48550/arXiv.2510.00294. URL http: //arxiv.org/abs/2510.00294. Xie, Z., Ye, J., Zheng, L., Gao, J., Dong, J., Wu, Z., Zhao, X., Gong, S., Jiang, X., Li, Z., and Kong, L. Dream-Coder 7B: An Open Diffusion Language Model for Code.arXiv preprint arXiv:2509.01142, September 2025. doi: 10. 48550/arXiv.2509.01142. URL http://arxiv.org/ abs/2509.01142. Ya...

  15. [16]

    Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

    URL https://aclanthology.org/2025. emnlp-main.1597/. Zhang, S., Peng, F. Z., Zhang, Y ., Pan, J., and Chrysos, G. G. Corrective Diffusion Language Models.arXiv preprint arXiv:2512.15596, December 2025b. doi: 10. 48550/arXiv.2512.15596. URL http://arxiv.org/ abs/2512.15596. Zhong, Y ., Gu, Y ., Zang, Z., Li, X., Ding, Y ., Jia, X., Shen, Y ., Lan, Z., Zhu,...

  16. [17]

    Training roles.The auxiliaryAR model is Qwen3-0.6B(finetuned end-to-end), while theDLM model (Dream- Coder-Instruct-7B) is kept frozen

  17. [18]

    Tokenizer/interface alignment.We use theDream-Coder tokenizerfor templating and masking, and perform soft-conditioningby multiplying the diffusion marginals with the AR embedding matrix;we do not remap token IDs

  18. [19]

    Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8

    Masking granularity.We replace token-wise masking withnon-overlapping block-level maskingover the response with fixed block sizeB∈ {2,4,8}. Hardware and environment.All experiments are run on a single NVIDIA A100 80GB GPU using Python 3.10.19, PyTorch 2.7.0, and CUDA 12.8. We enablebf16and gradient checkpointing. A.1.1. TRAININGROLES: AR TRAINED, DIFFUSIO...

  19. [20]

    • EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al

    and the sanitized version of the MBPP dataset (Austin et al., 2021b). • EvalPlus (HE+ & MBPP+): To ensure robustness against weak unit tests, we utilize the augmented test suites from Liu et al. (2023). For MBPP+, we execute the complete testing pipeline to maximize verification depth. • BigCodeBench: We evaluated on bothFullandHardsplits using the offici...

  20. [21]

    Interface & Latent Projection: Diffusion output probabilities over the Dream-Coder vocabulary are projected into the AR embedding space by computing the expected embedding relative to the AR embedding matrix. To ensure a clear termination signal, we apply a discretization step at the sequence boundary: if the diffusion model predicts the EOS token, we rep...

  21. [22]

    For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens

    Decoding Scope: At each denoising iteration, the model evaluates a scope of up to 10 masked blocks. For each candidate block, the AR model decodes a block of size B∈ {2,4,8} tokens. As shown in Table 2, increasing the scope results in a drop in accuracy due to prematureEOSprediction. Yet, the throughput is maintained within 15%

  22. [23]

    The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction

    Lowest-Entropy Unmasking: We calculate the average per-token entropy provided by the AR executor for each candidate block. The algorithm greedily unmasks the single block with the lowest average entropy, indicating the highest-confidence prediction. This yields a static parallelism ofBtokens per iteration. Implementation Details.All experiments are conduc...

  23. [24]

    (2020); Sohl-Dickstein et al

    Decomposition of the factorization gap Following Ho et al. (2020); Sohl-Dickstein et al. (2015), the negative ELBOL NELBO can be decomposed as follows: LNELBO =E x0∼q[Lx0 NELBO] =E x0:T ∼q log q(x1:T |x0) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(x0:T ) pθ(x0:T ) =H[x 0] +E x0:T ∼q log q(xT ) pθ(xT ) +PT t=1 log q(xt−1|xt) pθ(xt−1|xt) =H[x 0] +D KL(q(xT )∥p θ(xT...

  24. [25]

    Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 ,

    Decrease in NELBO for non-trivial block sizes We now quantify the reduction in the lower bound BB relative to the token-level baseline B1. Recall that a block bi t−1 is composed of the subsequence of tokens [x(i−1)·B+1 t−1 , . . . , xi·B t−1]. Subtracting the expression for BB derived in Part 1 from the expression forB 1 (where block size is 1), the terms...

  25. [26]

    Sufficiency of Soft-Conditioning:If pAR ϕ is conditioned on the full marginals π, there exists a parameterization ϕ such thatp AR ϕ (· |π) =q(·)

  26. [27]

    Fr´echet Class Restriction:Let πtop-k be the marginals truncated to the k most likely tokens at each position. Condition- ing on πtop-k restricts the valid solution space to the constrained Fr´echet class F(π top-k), strictly limiting the support of any recoverable distribution to the Cartesian product of the top-ksets

  27. [28]

    There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class

    Exclusion of the Global Mode:This restriction introduces an irreducible bias. There exist joint distributions q where the global modeb ∗ = arg maxb q(b)is strictly excluded from the support of the restricted class. Formally: ∃qsuch that∀q ′ ∈ F(π top-k), q ′(b∗) = 0< q(b ∗).(6) Thus, high-probability coherent structures can be rendered unrecoverable solel...

  28. [29]

    Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k

    Fr´echet Class Restriction.Let S i k ={t∈ V|rank(t, π i)≤k} be the set of top-k tokens at position i. Define the truncated marginal distribution πi top-k(t)∝π(t)·1 t∈Si k Any distribution q′ ∈ F(π top-k) must satisfy the marginal constraints of πtop-k. Since the probability mass of tokens outside S i k is effectively zeroed out in the input, any valid joi...

  29. [30]

    Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}

    Exclusion of the Global Mode.We demonstrate this exclusion via a counter-example. Let B= 2 , k= 1 , and V={Roger, Houston, Y ou, I, T hey}. Consider a distributionq(x 1, x2)with the following probability mass function: •q(Roger, Roger) = 0.45(The coherent modeb ∗). •q(Houston, Y ou) = 0.25. •q(Houston, I) = 0.25. •q(Houston, T hey) = 0.05. The induced mar...

  30. [31]

    =α tδxi t,xi 0 + (1−α t)δxi t,[MASK], a direct application of Bayes’ theorem results in the closed-form expression q(xi t−1|xi t, xi

  31. [32]

    =    1ifx i t−1 =x i t =x i 0, 1−αt−1 1−αt ifx i t−1 =x i t =[MASK], αt−1−αt 1−αt ifx i t−1 =x i 0, xi t =[MASK], 0otherwise . 18 Locally Coherent Parallel Decoding in Diffusion Language Models We introduce the abbreviationDand apply position-wise independence ofq(x t−1|xt,x 0)andp θ(xt−1|xt): D:=D KL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) =E q(xt−1|xt,x0)...

  32. [33]

    Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt))

    =p θ(xt−1|xt) =δ xi t−1,xi 0 . Hence, we get DKL(q(xt−1|xt,x 0)||pθ(xt−1|xt)) = LX i=1 δxi t=[MASK]DKL(q(xi t−1|xi t =[MASK], x i 0)||pθ(xi t−1|xt)). Now, note that for xi t−1 =[MASK] , the probability q(xi t−1|xi t, xi

  33. [34]

    does not depend on xi 0, resulting in pθ(xi t−1|xt) = q(xi t−1|xi t, xi 0)∀x i

  34. [35]

    Therefore, the only non-vanishing additive term in the KL divergence occurs whenx i t−1 =x i 0, i.e., D= LX i=1 δxi t=[MASK]q(xi t−1 =x i 0|xi t =[MASK], x i

  35. [36]

    Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK]

    log q(xi t−1 =x i 0|xi t =[MASK], x i 0) pθ(xi t−1 =x i 0|xt) . Finally, we compute pθ(xi t−1 =x i 0|xt) provided that xi t =[MASK] . First, note that q(xi t−1 =x i 0|xi t =[MASK],˜xi

  36. [37]

    = αt−1−αt 1−αt δxi 0,˜xi 0 + 1−αt−1 1−αt δxi 0,[MASK]. Then, due to the assumption thatp θ(xi 0 =[MASK]|x t) = 0, it follows that pθ(xi t−1 =x i 0|xt) =E pθ(˜xi 0|xt)[q(xi t−1 =x i 0|xi t =[MASK],˜xi 0)] =p θ(xi 0|xt)q(xi t−1 =x i 0|xi t =[MASK], x i 0). Plugging in and canceling equal factors then results in the desired expression D= LX i=1 −δxi t=[MASK]...