Multi-Block Diffusion Language Models

Chenkai Xu; Dandan Tu; Jiajun Li; Jiajun Xu; Kai Yu; Pengfei Liu; Xiaohui Yan; Yijie Jin; Yi Tu; Yuxuan Liu

arxiv: 2606.29215 · v1 · pith:NHCKQL3Ynew · submitted 2026-06-28 · 💻 cs.LG · cs.CL

Multi-Block Diffusion Language Models

Yijie Jin , Jiajun Xu , Yuxuan Liu , Chenkai Xu , Yi Tu , Jiajun Li , Dandan Tu , Xiaohui Yan

show 3 more authors

Kai Yu Pengfei Liu Zhijie Deng

This is my paper

Pith reviewed 2026-06-30 08:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords block diffusionmulti-block diffusionteacher forcingdiffusion language modelsparallel decodingtext generationkv caching

0 comments

The pith

Post-training block diffusion models with multi-block teacher forcing enables concurrent decoding of multiple blocks while maintaining or improving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to extend single-block diffusion language models to multi-block versions that decode several consecutive blocks at once for greater parallelism. Current training methods observe only one noisy block at a time and therefore do not match the heterogeneous noise patterns across a running set of blocks during multi-block inference. By introducing multi-block teacher forcing that trains on bounded groups of noisy blocks with randomized noise schedules conditioned on clean prefixes, the models can be adapted to handle this setting. This produces nearly doubled tokens generated per model call on math and code tasks with little or no loss in accuracy. An optimized block buffer mechanism makes the parallel decoding practical by preserving prefix-cache reuse and keeping input shapes fixed.

Core claim

Block diffusion language models can be turned into multi-block diffusion language models by post-training with multi-block teacher forcing. Multi-block teacher forcing integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, using randomized noise-schedulers that align training states with the heterogeneous slot-wise noise patterns of multi-block inference.

What carries the argument

Multi-block Teacher Forcing (MultiTF): a post-training procedure that exposes the model to bounded noise-groups with randomized noise-schedulers to match the states encountered when a running-set of blocks is decoded concurrently.

If this is right

Average tokens per forward pass rises from 3.47 to 6.19 while accuracy improves from 79.95 percent to 81.03 percent.
When combined with DMax, tokens per forward pass reach 9.34 with only a 1.02 percent accuracy drop.
The block buffer mechanism enables practical multi-block execution by preserving prefix-cache reuse and static input shapes.
Inter-block parallelism becomes usable without requiring changes to the underlying model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the training-inference alignment holds, the same post-training step could be tested on other diffusion-based generation pipelines that currently use single-block teacher forcing.
The measured increase in tokens per forward pass implies corresponding wall-clock speedups on hardware that batches forward passes efficiently.

Load-bearing premise

The randomized noise-schedulers and bounded noise-groups used in training will produce states close enough to the varied noise levels across blocks that appear during actual multi-block inference.

What would settle it

Measure tokens per forward pass and accuracy of an MBD-trained model versus a standard BD model when both are run under multi-block inference on the same math and code benchmarks; absence of the reported TPF gains would falsify the claim.

read the original abstract

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a \textit{running-set} of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded \textit{running-set} with heterogeneous slot-wise noise patterns. To bridge this gap, we propose \textit{Multi-Block Diffusion Language Models} (MBD-LMs), obtained by post-training BD-LMs with \textit{Multi-block Teacher Forcing} (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded \textit{noise-groups} conditioned on clean prefixes, with randomized \textit{noise-schedulers} that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the \textit{Block Buffer} mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to \textbf{6.19} and improves average accuracy from 79.95\% to \textbf{81.03\%}; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of \textbf{9.34} with only a 1.02\% accuracy drop on math and code benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiTF post-training and Block Buffer give measurable TPF gains on block diffusion LMs, but the noise-distribution match between train and inference stays unverified.

read the letter

The core of this paper is a post-training step called MultiTF that trains BD-LMs on bounded noise-groups with randomized schedulers, plus a Block Buffer that keeps decoding shapes static and reuses prefix caches. This lets them move from single-block to multi-block diffusion inference with higher parallelism.

What the work actually shows is concrete: MBD-LLaDA2-Mini lifts average TPF from 3.47 to 6.19 while accuracy edges up from 79.95% to 81.03% on math and code benchmarks. Adding DMax pushes TPF to 9.34 at a 1.02% accuracy cost. The Block Buffer description explains how they turn extra parallelism into wall-clock gains without changing input shapes.

The soft spot is the central assumption. MultiTF is meant to close the gap between teacher-forcing training states and the heterogeneous per-slot noise that appears when a running-set of blocks decodes together. The abstract claims the randomized schedulers achieve this, but it gives no direct check—moments, KL divergence, or even a side-by-side plot—of the training noise distribution versus the actual inference noise pattern. Without that, the TPF and accuracy lifts could come from generic extra training rather than from fixing the train-inference mismatch. The reported numbers also lack variance, exact data splits, or statistical tests, which makes the empirical claim plausible but not yet tight.

This paper is for people already working on diffusion-based language models who need faster generation. The techniques are specific and the efficiency numbers are directly measurable, so the work is worth a serious referee even if the noise-matching evidence needs strengthening in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Multi-Block Diffusion Language Models (MBD-LMs) by post-training existing Block Diffusion LMs with Multi-block Teacher Forcing (MultiTF), which trains on bounded noise-groups conditioned on clean prefixes using randomized noise-schedulers to better align with MultiBD inference on a running-set of blocks with heterogeneous per-slot noise. It further introduces a Block Buffer mechanism for optimized decoding that preserves KV-cache reuse and static input shapes. On math and code benchmarks, MBD-LLaDA2-Mini reports average TPF rising from 3.47 to 6.19 with accuracy improving from 79.95% to 81.03%; combining with DMax yields TPF of 9.34 at a 1.02% accuracy cost.

Significance. If the reported TPF gains prove robust and the MultiTF procedure demonstrably closes the train-inference gap, the work offers a practical route to higher parallelism in diffusion-based language models without substantial quality loss. The Block Buffer decoding algorithm provides a concrete engineering solution for maintaining efficiency under concurrent block decoding. The concrete numerical improvements on standard benchmarks constitute a tangible contribution to efficient generative modeling, though the lack of direct distributional validation weakens the causal attribution to gap closure.

major comments (2)

[Abstract and §3] Abstract and §3 (MultiTF description): The central claim that randomized noise-schedulers and bounded noise-groups produce training states whose joint noise statistics match the heterogeneous slot-wise noise patterns of MultiBD inference with a running-set is load-bearing for attributing the TPF/accuracy gains to gap closure rather than generic post-training; however, no quantitative comparison (moments, histograms, or divergence metrics) between the training noise-group distribution and the inference noise pattern conditioned on the evolving running-set is reported.
[§4 (Experiments)] §4 (Experiments) and abstract results: The average TPF (6.19, 9.34) and accuracy (81.03%, 1.02% drop) figures are presented without standard deviations across runs, number of evaluation seeds, exact data splits, baseline implementation details, or statistical significance tests; this absence makes it impossible to assess whether the reported lifts exceed noise and undermines the reliability of the cross-configuration claims.

minor comments (1)

[Abstract] Abstract: The term 'DMax' appears without definition or citation, requiring readers to infer its meaning from later sections.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (MultiTF description): The central claim that randomized noise-schedulers and bounded noise-groups produce training states whose joint noise statistics match the heterogeneous slot-wise noise patterns of MultiBD inference with a running-set is load-bearing for attributing the TPF/accuracy gains to gap closure rather than generic post-training; however, no quantitative comparison (moments, histograms, or divergence metrics) between the training noise-group distribution and the inference noise pattern conditioned on the evolving running-set is reported.

Authors: We agree that a quantitative validation of the noise distribution alignment would strengthen the causal link between MultiTF and the observed gains. In the revision we will add to §3 a direct comparison including first- and second-order moments of the per-slot noise levels, as well as histograms and a simple divergence measure, computed both on the bounded noise-groups used during MultiTF training and on simulated running-sets drawn from MultiBD inference trajectories. This addition will be placed immediately after the MultiTF description. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments) and abstract results: The average TPF (6.19, 9.34) and accuracy (81.03%, 1.02% drop) figures are presented without standard deviations across runs, number of evaluation seeds, exact data splits, baseline implementation details, or statistical significance tests; this absence makes it impossible to assess whether the reported lifts exceed noise and undermines the reliability of the cross-configuration claims.

Authors: We will expand §4 with the exact evaluation data splits, full baseline implementation details (including the precise BD-LM checkpoint and decoding settings used for the 3.47 TPF reference), and a statement that all numbers are from single runs. We will also add a short discussion of consistency across the math and code benchmarks. Because the original experiments were performed with a single seed per configuration, we cannot retroactively supply standard deviations or significance tests without new compute; this limitation will be explicitly noted. revision: partial

standing simulated objections not resolved

Supplying standard deviations across multiple evaluation seeds and performing statistical significance tests, because the reported results derive from single-run experiments.

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on held-out benchmark measurements

full rationale

The paper presents MBD-LMs via post-training with MultiTF (bounded noise-groups and randomized noise-schedulers) and reports TPF/accuracy numbers obtained by direct evaluation on math and code benchmarks. These quantities are measured outcomes, not quantities obtained by fitting parameters inside the same equations or by renaming inputs as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation chain. The central assumption about noise distribution match is an empirical design choice whose validity is tested by the external benchmark results rather than being true by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are introduced; the work relies on standard diffusion training assumptions and existing BD-LM architectures without new postulated objects.

pith-pipeline@v0.9.1-grok · 5895 in / 1076 out tokens · 26388 ms · 2026-06-30T08:50:38.508951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 22 canonical work pages · 11 internal anchors

[1]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

2022
[6]

International Conference on Learning Representations (ICLR) , year =

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[14]

arXiv preprint arXiv:2512.15596 , year =

Corrective Diffusion Language Models , author =. arXiv preprint arXiv:2512.15596 , year =

work page arXiv
[20]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
[25]

2025 , howpublished =

Python Code Dataset 500k , author =. 2025 , howpublished =

2025
[26]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2503.09573. Oral Presentation

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie, Zenan Huang, Chongxuan Li, et al. Llada2.0: Scaling up diffusion language models to 100b. arXiv preprint arXiv:2512.15745, 2025. URL https://arxiv.org/abs/2512.15745

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Llada2.1: Speeding up text diffusion via token editing

Tiwei Bie et al. Llada2.1: Speeding up text diffusion via token editing. arXiv preprint arXiv:2602.08676, 2026. URL https://arxiv.org/abs/2602.08676

work page arXiv 2026
[29]

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, C \'e line Hudelot, and Pierre Colombo. When does reasoning matter? a controlled study of reasoning's contribution to model performance. arXiv preprint arXiv:2509.22193, 2025. URL https://arxiv.org/abs/2509.22193

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

dparallel: Learnable parallel decoding for dllms

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms. arXiv preprint arXiv:2509.26488, 2025. URL https://arxiv.org/abs/2509.26488

work page arXiv 2025
[31]

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. Dmax: Aggressive parallel decoding for dllms. arXiv preprint arXiv:2604.08302, 2026. URL https://arxiv.org/abs/2604.08302

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation. arXiv preprint arXiv:2510.06303, 2025. URL https://arxiv.org/abs/2510.06303

work page arXiv 2025
[33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Lightningrl: Breaking the accuracy--parallelism trade-off of block-wise dllms via reinforcement learning

Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, and Zhijie Deng. Lightningrl: Breaking the accuracy--parallelism trade-off of block-wise dllms via reinforcement learning. arXiv preprint arXiv:2603.13319, 2026. URL https://arxiv.org/abs/2603.13319

work page arXiv 2026
[36]

Python code dataset 500k

jtatman . Python code dataset 500k. Hugging Face dataset, 2025. URL https://huggingface.co/datasets/jtatman/python-code-dataset-500k

2025
[37]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36: 0 21558--21572, 2023

2023
[38]

Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, and Hongxiang Fan. Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size. arXiv preprint arXiv:2509.26432, 2026. URL https://arxiv.org/abs/2509.26432

work page arXiv 2026
[39]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. URL https://arxiv.org/abs/2508.02317

work page arXiv 2025
[40]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025. URL https://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation. arXiv preprint arXiv:2601.07568, 2026. URL https://arxiv.org/abs/2601.07568

work page arXiv 2026
[42]

Chiu, Alexander Rush, and Volodymyr Kuleshov

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024. URL https://arxiv.org/abs/2406.07524

work page arXiv 2024
[43]

Diffusion llms can do faster-than-ar inference via discrete diffusion forcing

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing. arXiv preprint arXiv:2508.09192, 2025. URL https://arxiv.org/abs/2508.09192

work page arXiv 2025
[44]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu et al. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. arXiv preprint arXiv:2505.22618, 2025. URL https://arxiv.org/abs/2505.22618

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Lopa: Scaling dllm inference via lookahead parallel decoding

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, and Zhijie Deng. Lopa: Scaling dllm inference via lookahead parallel decoding. arXiv preprint arXiv:2512.16229, 2025. URL https://arxiv.org/abs/2512.16229

work page arXiv 2025
[46]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025. URL https://arxiv.org/abs/2508.15487

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

2022

[2] [6]

International Conference on Learning Representations (ICLR) , year =

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author =. International Conference on Learning Representations (ICLR) , year =

[3] [14]

arXiv preprint arXiv:2512.15596 , year =

Corrective Diffusion Language Models , author =. arXiv preprint arXiv:2512.15596 , year =

work page arXiv

[4] [20]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [21]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [22]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=

[7] [25]

2025 , howpublished =

Python Code Dataset 500k , author =. 2025 , howpublished =

2025

[8] [26]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2503.09573. Oral Presentation

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [27]

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie, Zenan Huang, Chongxuan Li, et al. Llada2.0: Scaling up diffusion language models to 100b. arXiv preprint arXiv:2512.15745, 2025. URL https://arxiv.org/abs/2512.15745

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [28]

Llada2.1: Speeding up text diffusion via token editing

Tiwei Bie et al. Llada2.1: Speeding up text diffusion via token editing. arXiv preprint arXiv:2602.08676, 2026. URL https://arxiv.org/abs/2602.08676

work page arXiv 2026

[11] [29]

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, C \'e line Hudelot, and Pierre Colombo. When does reasoning matter? a controlled study of reasoning's contribution to model performance. arXiv preprint arXiv:2509.22193, 2025. URL https://arxiv.org/abs/2509.22193

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [30]

dparallel: Learnable parallel decoding for dllms

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms. arXiv preprint arXiv:2509.26488, 2025. URL https://arxiv.org/abs/2509.26488

work page arXiv 2025

[13] [31]

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. Dmax: Aggressive parallel decoding for dllms. arXiv preprint arXiv:2604.08302, 2026. URL https://arxiv.org/abs/2604.08302

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [32]

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation. arXiv preprint arXiv:2510.06303, 2025. URL https://arxiv.org/abs/2510.06303

work page arXiv 2025

[15] [33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [34]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [35]

Lightningrl: Breaking the accuracy--parallelism trade-off of block-wise dllms via reinforcement learning

Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, and Zhijie Deng. Lightningrl: Breaking the accuracy--parallelism trade-off of block-wise dllms via reinforcement learning. arXiv preprint arXiv:2603.13319, 2026. URL https://arxiv.org/abs/2603.13319

work page arXiv 2026

[18] [36]

Python code dataset 500k

jtatman . Python code dataset 500k. Hugging Face dataset, 2025. URL https://huggingface.co/datasets/jtatman/python-code-dataset-500k

2025

[19] [37]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36: 0 21558--21572, 2023

2023

[20] [38]

Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, and Hongxiang Fan. Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size. arXiv preprint arXiv:2509.26432, 2026. URL https://arxiv.org/abs/2509.26432

work page arXiv 2026

[21] [39]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. URL https://arxiv.org/abs/2508.02317

work page arXiv 2025

[22] [40]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025. URL https://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [41]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation. arXiv preprint arXiv:2601.07568, 2026. URL https://arxiv.org/abs/2601.07568

work page arXiv 2026

[24] [42]

Chiu, Alexander Rush, and Volodymyr Kuleshov

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024. URL https://arxiv.org/abs/2406.07524

work page arXiv 2024

[25] [43]

Diffusion llms can do faster-than-ar inference via discrete diffusion forcing

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing. arXiv preprint arXiv:2508.09192, 2025. URL https://arxiv.org/abs/2508.09192

work page arXiv 2025

[26] [44]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu et al. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. arXiv preprint arXiv:2505.22618, 2025. URL https://arxiv.org/abs/2505.22618

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [45]

Lopa: Scaling dllm inference via lookahead parallel decoding

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, and Zhijie Deng. Lopa: Scaling dllm inference via lookahead parallel decoding. arXiv preprint arXiv:2512.16229, 2025. URL https://arxiv.org/abs/2512.16229

work page arXiv 2025

[28] [46]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025. URL https://arxiv.org/abs/2508.15487

work page internal anchor Pith review Pith/arXiv arXiv 2025