DMax: Aggressive Parallel Decoding for dLLMs

Gongfan Fang; Ruonan Yu; Xinchao Wang; Xinyin Ma; Zigeng Chen

arxiv: 2604.08302 · v3 · pith:KFKTDL5Hnew · submitted 2026-04-09 · 💻 cs.LG · cs.AI

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen , Gongfan Fang , Xinyin Ma , Ruonan Yu , Xinchao Wang This is my paper

Pith reviewed 2026-05-19 16:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion language modelsparallel decodingembedding interpolationself-refinementerror accumulationtokens per forward passOn-Policy Uniform Training

0 comments

The pith

DMax reformulates parallel decoding for diffusion language models as progressive self-refinement from mask to token embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DMax to allow more aggressive parallel decoding in diffusion language models without the usual buildup of errors. It introduces On-Policy Uniform Training so the model learns to correct both masked inputs and its own past mistakes. Soft Parallel Decoding then treats each step as an interpolation in embedding space between the mask and the model's current best guess, letting the model revise predictions iteratively. Experiments show this raises tokens per forward pass on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while accuracy stays comparable. The approach matters because it turns diffusion language models into faster generators for math and code tasks.

Core claim

DMax mitigates error accumulation in parallel decoding for dLLMs by reformulating the process as a progressive self-refinement from mask embeddings to token embeddings. On-Policy Uniform Training unifies masked and uniform dLLMs so the model recovers clean tokens from both masked inputs and its own erroneous predictions. Soft Parallel Decoding represents each intermediate state as an interpolation between the predicted token embedding and the mask embedding to enable iterative self-revising.

What carries the argument

Soft Parallel Decoding via interpolation between predicted token embedding and mask embedding for iterative self-revising in embedding space.

Load-bearing premise

Representing each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding enables effective iterative self-revising without accumulating new errors.

What would settle it

Running DMax on GSM8K and observing that tokens per forward pass stays near 2 or that accuracy falls below the LLaDA-2.0-mini baseline.

Figures

Figures reproduced from arXiv: 2604.08302 by Gongfan Fang, Ruonan Yu, Xinchao Wang, Xinyin Ma, Zigeng Chen.

**Figure 2.** Figure 2: Overview of the proposed On-Policy Uniform Training. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Soft Parallel Decoding procedure in DMax. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of accuracy-TPF trade-off curves between original LLaDA-2.0-mini model [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DMax delivers concrete TPF gains on math and code tasks for diffusion LLMs by pairing uniform-noise training with embedding interpolation, but the gains hinge on an untested assumption that interpolated states stay inside the model's denoising range.

read the letter

DMax gives a practical way to push parallelism higher in diffusion LLMs. The core moves are On-Policy Uniform Training, which mixes masked and uniform inputs so the model learns to fix its own mistakes, and Soft Parallel Decoding, which steps through linear interpolations between mask and predicted token embeddings instead of hard flips. The reported numbers show TPF rising from 2.04 to 5.47 on GSM8K and from 2.71 to 5.86 on MBPP with accuracy roughly preserved, plus a GitHub link for the code. That combination is the main thing worth noting here.

Referee Report

2 major / 2 minor

Summary. The paper introduces DMax for diffusion language models (dLLMs), reformulating parallel decoding as progressive self-refinement via interpolation between predicted token embeddings and mask embeddings. It proposes On-Policy Uniform Training to unify masked and uniform dLLM training for recovery from erroneous predictions, and Soft Parallel Decoding to enable iterative self-revision in embedding space. The central empirical claim is that this yields substantial gains in tokens per forward pass (TPF) while preserving accuracy: on GSM8K, TPF rises from 2.04 to 5.47 versus LLaDA-2.0-mini; on MBPP, from 2.71 to 5.86. High throughput (average 1,338 TPS at batch size 1 on two H200 GPUs) is also reported, with code released at https://github.com/czg1225/DMax.

Significance. If the embedding-space interpolation and on-policy training reliably prevent error accumulation under aggressive parallelism, the work would offer a practical route to higher-throughput dLLM inference without quality degradation, addressing a key limitation in current masked diffusion models. The GitHub code release supports reproducibility and is a clear strength.

major comments (2)

[Soft Parallel Decoding (method description)] The central performance claims rest on Soft Parallel Decoding's linear interpolation between predicted token and mask embeddings producing states that the model can reliably denoise across parallel steps. No ablation is presented that removes the interpolation (or varies the schedule) while keeping On-Policy Uniform Training fixed, leaving open whether the reported TPF gains (GSM8K: 2.04→5.47; MBPP: 2.71→5.86) survive without this component or when interpolated states fall outside the training manifold.
[Experiments and results] The experimental results section reports point estimates for TPF and accuracy but supplies neither error bars across multiple runs nor a detailed specification of the interpolation schedule or exact on-policy sampling procedure. These omissions make it impossible to assess whether the gains are robust or sensitive to the precise training/decoding controls that the abstract claims are essential.

minor comments (2)

The abstract states 'extensive experiments across a variety of benchmarks' yet only GSM8K and MBPP numbers are highlighted; a summary table of additional tasks would strengthen the generality claim.
[Method] Notation for the interpolation parameter (e.g., the mixing coefficient between token and mask embeddings) should be introduced with an explicit equation in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [Soft Parallel Decoding (method description)] The central performance claims rest on Soft Parallel Decoding's linear interpolation between predicted token and mask embeddings producing states that the model can reliably denoise across parallel steps. No ablation is presented that removes the interpolation (or varies the schedule) while keeping On-Policy Uniform Training fixed, leaving open whether the reported TPF gains (GSM8K: 2.04→5.47; MBPP: 2.71→5.86) survive without this component or when interpolated states fall outside the training manifold.

Authors: We agree that an explicit ablation isolating the interpolation component of Soft Parallel Decoding would strengthen the claims. In the revised manuscript we add this ablation: we retain On-Policy Uniform Training but replace the linear embedding interpolation with conventional hard (binary mask-to-token) parallel decoding. The results show that TPF gains are substantially smaller and accuracy degrades at the reported parallelism levels, confirming that the embedding-space refinement is necessary to avoid error accumulation. We also document the exact linear schedule (alpha_t = t / T for decoding horizon T) and note that on-policy training, which repeatedly exposes the model to its own intermediate predictions, keeps interpolated states inside the training distribution. revision: yes
Referee: [Experiments and results] The experimental results section reports point estimates for TPF and accuracy but supplies neither error bars across multiple runs nor a detailed specification of the interpolation schedule or exact on-policy sampling procedure. These omissions make it impossible to assess whether the gains are robust or sensitive to the precise training/decoding controls that the abstract claims are essential.

Authors: We accept this criticism. The revised manuscript now reports means and standard deviations over five independent runs with different random seeds for all TPF and accuracy figures on GSM8K and MBPP. We have also added a precise description of the interpolation schedule and the on-policy sampling procedure in Section 3 and the appendix: at each training step we sample the model's current prediction, form the interpolated embedding, and train the model to recover the clean token from that state. These additions allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new procedures are independent additions to existing dLLM backbones.

full rationale

The paper introduces On-Policy Uniform Training and Soft Parallel Decoding as novel strategies that unify training regimes and reformulate decoding as embedding-space interpolation. Reported TPF gains (e.g., GSM8K 2.04 to 5.47) are presented as empirical outcomes from experiments on LLaDA-2.0-mini, not as quantities derived by construction from fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes reduce the central claims to inputs by definition. The method is self-contained with independent content against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard diffusion-model assumptions about gradual denoising and embedding continuity; no new free parameters or invented entities are introduced in the abstract summary.

axioms (1)

domain assumption Diffusion language models can be trained to recover clean tokens from noisy or masked inputs.
Invoked when describing On-Policy Uniform Training that equips the model to recover from both masks and its own errors.

pith-pipeline@v0.9.0 · 5778 in / 1263 out tokens · 40424 ms · 2026-05-19T16:42:17.239764+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On-Policy Uniform Training... samples noisy inputs on-policy from the model’s own predictive distribution.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 25 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint, arXiv:2504.04030, 2025

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025

work page arXiv 2025
[3]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Learning to parallel: Accelerating diffusion large language models via adaptive parallel decoding

Wenrui Bao, Zhiben Chen, Dan Xu, and Yuzhang Shang. Learning to parallel: Accelerating diffusion large language models via adaptive parallel decoding. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[8]

Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

work page arXiv 2025
[9]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

work page arXiv 2026
[10]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

arXiv preprint arXiv:2602.06036 , year=

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review arXiv 2026
[13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[14]

dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

Shirui Chen, Jiantao Jiao, Lillian J Ratliff, and Banghua Zhu. dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

work page arXiv 2025
[15]

Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai Li, Yiran Chen, et al. Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148, 2025

work page arXiv 2025
[16]

dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025. 11

work page arXiv 2025
[17]

Sdar: A syn- ergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, et al. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

work page arXiv 2025
[18]

Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

work page arXiv 2025
[19]

Moe-diffuseq: Enhancing long-document diffusion models with sparse attention and mixture of experts.arXiv preprint arXiv:2512.20604, 2025

Alexandros Christoforos and Chadbourne Davis. Moe-diffuseq: Enhancing long-document diffusion models with sparse attention and mixture of experts.arXiv preprint arXiv:2512.20604, 2025

work page arXiv 2025
[20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Stable-diffcoder: Pushing the frontier of code diffusion large language model.arXiv preprint arXiv:2601.15892, 2026

Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, and Wei Wei. Stable-diffcoder: Pushing the frontier of code diffusion large language model.arXiv preprint arXiv:2601.15892, 2026

work page arXiv 2026
[22]

dvoting: Fast voting for dllms.arXiv preprint arXiv:2602.12153, 2026

Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, and Xinchao Wang. dvoting: Fast voting for dllms.arXiv preprint arXiv:2602.12153, 2026

work page arXiv 2026
[23]

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, et al. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025
[25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Reward-weighted sampling: Enhancing non-autoregressive characteristics in masked diffusion llms.arXiv preprint arXiv:2509.00707, 2025

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, and Jaegul Choo. Reward-weighted sampling: Enhancing non-autoregressive characteristics in masked diffusion llms.arXiv preprint arXiv:2509.00707, 2025

work page arXiv 2025
[28]

Ultrallada: Scaling the context length to 128k for diffusion large language models.arXiv preprint arXiv:2510.10481, 2025

Guangxin He, Shen Nie, Fengqi Zhu, Yuankang Zhao, Tianyi Bai, Ran Yan, Jie Fu, Chongxuan Li, and Binhang Yuan. Ultrallada: Scaling the context length to 128k for diffusion large language models.arXiv preprint arXiv:2510.10481, 2025

work page arXiv 2025
[29]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models.arXiv preprint arXiv:2510.17206, 2025

work page arXiv 2025
[31]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[32]

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, and Jiangchao Yao. Wide-in, narrow-out: Revokable decoding for efficient and effective dllms.arXiv preprint arXiv:2507.18578, 2025

work page arXiv 2025
[33]

Lightningrl: Breaking the accuracy- parallelism trade-off of block-wise dllms via reinforcement learning.arXiv preprint arXiv:2603.13319, 2026

Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, and Zhijie Deng. Lightningrl: Breaking the accuracy- parallelism trade-off of block-wise dllms via reinforcement learning.arXiv preprint arXiv:2603.13319, 2026

work page arXiv 2026
[34]

Residual Context Diffusion Language Models

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W Mahoney, Sewon Min, Mehrdad Farajtabar, et al. Residual context diffusion language models.arXiv preprint arXiv:2601.22954, 2026. 12

work page arXiv 2026
[35]

S., Seo, J.-s., Zhang, Z., and Gupta, U

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Accelerating diffusion language model inference via efficient kv caching and guided diffusion. arXiv preprint arXiv:2505.21467, 2025

work page arXiv 2025
[36]

Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

work page arXiv 2025
[37]

Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025

Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025

work page arXiv 2025
[38]

Cdlm: Consistency diffusion language models for faster sampling.arXiv preprint arXiv:2511.19269, 2025

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, and Amir Gholami. Cdlm: Consistency diffusion language models for faster sampling.arXiv preprint arXiv:2511.19269, 2025

work page arXiv 2025
[39]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[40]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024
[41]

Beyond fixed: Variable-length denoising for diffusion large language models.arXiv e-prints, pages arXiv–2508, 2025

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Variable-length denoising for diffusion large language models.arXiv e-prints, pages arXiv–2508, 2025

work page 2025
[42]

Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b

Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models.arXiv preprint arXiv:2508.10875, 2025

work page arXiv 2025
[43]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[45]

Wedlm: Reconciling diffusion language models with standard causal atten- tion for fast inference.arXiv preprint arXiv:2512.22737,

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. Wedlm: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025

work page arXiv 2025
[46]

Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

work page arXiv 2025
[47]

Longllada: Unlocking long context capabilities in diffusion llms

Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Longllada: Unlocking long context capabilities in diffusion llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32186–32194, 2026

work page 2026
[48]

Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan Song, Minghui Lin, Han Zhao, Hongyin Zhang, Zifeng Zhuang, Wei Zhao, et al. Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

work page arXiv 2026
[49]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025
[50]

Focus- dllm: Accelerating long-context diffusion llm inference via confidence-guided context focusing.arXiv preprint arXiv:2602.02159, 2026

Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, and Jianlei Yang. Focus- dllm: Accelerating long-context diffusion llm inference via confidence-guided context focusing.arXiv preprint arXiv:2602.02159, 2026

work page arXiv 2026
[51]

Discrete diffusion language modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. 2023

work page 2023
[52]

Diffusion in diffusion: Breaking the autoregressive bottleneck in block diffusion models.arXiv preprint arXiv:2601.13599, 2026

Linrui Ma, Yufei Cui, Kai Han, and Yunhe Wang. Diffusion in diffusion: Breaking the autoregressive bottleneck in block diffusion models.arXiv preprint arXiv:2601.13599, 2026

work page arXiv 2026
[53]

dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 13

work page arXiv 2025
[54]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, et al. dinfer: An efficient inference framework for diffusion language models.arXiv preprint arXiv:2510.08666, 2025

work page arXiv 2025
[55]

A diverse corpus for evaluating and developing english math word problem solvers, 2021

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021

work page 2021
[56]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

work page arXiv 2025
[57]

The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, et al. The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

work page arXiv 2026
[58]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, et al. d-treerpo: Towards more reliable policy optimization for diffusion language models.arXiv preprint arXiv:2512.09675, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models

Xiaojing Qi, Lun Du, Xinyuan Zhang, Lanning Wei, Tao Jin, and Da Zheng. Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models. InThe Fourteenth International Conference on Learning Representations

work page
[62]

d3llm: Ultra-fast diffusion llm using pseudo- trajectory distillation.arXiv preprint arXiv:2601.07568,

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

work page arXiv 2026
[63]

Improving reasoning for diffusion language models via group diffusion policy optimization

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, and Wei Deng. Improving reasoning for diffusion language models via group diffusion policy optimization.arXiv preprint arXiv:2510.08554, 2025

work page arXiv 2025
[64]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[65]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[66]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[67]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

work page arXiv 2025
[68]

Simple guidance mecha- nisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Alexander Rush, V olodymyr Kuleshov, Hugo Dalla-Torre, Sam Boshar, Bernardo P de Almeida, and Thomas Pierrot. Simple guidance mecha- nisms for discrete diffusion models. In... International Conference on Learning Representations, volume 2025, page 44153, 2025

work page 2025
[69]

Scaling beyond masked diffusion language models.arXiv e-prints, pages arXiv–2602, 2026

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv e-prints, pages arXiv–2602, 2026

work page 2026
[70]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131– 103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131– 103167, 2024

work page 2024
[71]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[72]

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026. 14

work page 2026
[73]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

work page arXiv 2025
[75]

From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776,

Yuchuan Tian, Yuchen Liang, Shuo Zhang, Yingte Shu, Guangwen Yang, Wei He, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, et al. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2025

work page arXiv 2025
[76]

Generalized interpolating discrete diffusion, 2025

Dimitri V on Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482, 2025

work page arXiv 2025
[77]

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, and Weiyao Lin. Creditdecoding: Accelerating parallel decoding in diffusion large language models with trace credits. arXiv preprint arXiv:2510.06133, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192,

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025
[79]

Sparsed: Sparse attention for diffusion language models

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[80]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint, arXiv:2504.04030, 2025

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025

work page arXiv 2025

[3] [3]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021

[5] [5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Learning to parallel: Accelerating diffusion large language models via adaptive parallel decoding

Wenrui Bao, Zhiben Chen, Dan Xu, and Yuzhang Shang. Learning to parallel: Accelerating diffusion large language models via adaptive parallel decoding. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025

[8] [8]

Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

work page arXiv 2025

[9] [9]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

work page arXiv 2026

[10] [10]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

arXiv preprint arXiv:2602.06036 , year=

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review arXiv 2026

[13] [13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[14] [14]

dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

Shirui Chen, Jiantao Jiao, Lillian J Ratliff, and Banghua Zhu. dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

work page arXiv 2025

[15] [15]

Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai Li, Yiran Chen, et al. Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148, 2025

work page arXiv 2025

[16] [16]

dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025. 11

work page arXiv 2025

[17] [17]

Sdar: A syn- ergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, et al. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

work page arXiv 2025

[18] [18]

Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

work page arXiv 2025

[19] [19]

Moe-diffuseq: Enhancing long-document diffusion models with sparse attention and mixture of experts.arXiv preprint arXiv:2512.20604, 2025

Alexandros Christoforos and Chadbourne Davis. Moe-diffuseq: Enhancing long-document diffusion models with sparse attention and mixture of experts.arXiv preprint arXiv:2512.20604, 2025

work page arXiv 2025

[20] [20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Stable-diffcoder: Pushing the frontier of code diffusion large language model.arXiv preprint arXiv:2601.15892, 2026

Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, and Wei Wei. Stable-diffcoder: Pushing the frontier of code diffusion large language model.arXiv preprint arXiv:2601.15892, 2026

work page arXiv 2026

[22] [22]

dvoting: Fast voting for dllms.arXiv preprint arXiv:2602.12153, 2026

Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, and Xinchao Wang. dvoting: Fast voting for dllms.arXiv preprint arXiv:2602.12153, 2026

work page arXiv 2026

[23] [23]

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, et al. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025

[25] [25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Reward-weighted sampling: Enhancing non-autoregressive characteristics in masked diffusion llms.arXiv preprint arXiv:2509.00707, 2025

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, and Jaegul Choo. Reward-weighted sampling: Enhancing non-autoregressive characteristics in masked diffusion llms.arXiv preprint arXiv:2509.00707, 2025

work page arXiv 2025

[28] [28]

Ultrallada: Scaling the context length to 128k for diffusion large language models.arXiv preprint arXiv:2510.10481, 2025

Guangxin He, Shen Nie, Fengqi Zhu, Yuankang Zhao, Tianyi Bai, Ran Yan, Jie Fu, Chongxuan Li, and Binhang Yuan. Ultrallada: Scaling the context length to 128k for diffusion large language models.arXiv preprint arXiv:2510.10481, 2025

work page arXiv 2025

[29] [29]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models.arXiv preprint arXiv:2510.17206, 2025

work page arXiv 2025

[31] [31]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[32] [32]

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, and Jiangchao Yao. Wide-in, narrow-out: Revokable decoding for efficient and effective dllms.arXiv preprint arXiv:2507.18578, 2025

work page arXiv 2025

[33] [33]

Lightningrl: Breaking the accuracy- parallelism trade-off of block-wise dllms via reinforcement learning.arXiv preprint arXiv:2603.13319, 2026

Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, and Zhijie Deng. Lightningrl: Breaking the accuracy- parallelism trade-off of block-wise dllms via reinforcement learning.arXiv preprint arXiv:2603.13319, 2026

work page arXiv 2026

[34] [34]

Residual Context Diffusion Language Models

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W Mahoney, Sewon Min, Mehrdad Farajtabar, et al. Residual context diffusion language models.arXiv preprint arXiv:2601.22954, 2026. 12

work page arXiv 2026

[35] [35]

S., Seo, J.-s., Zhang, Z., and Gupta, U

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Accelerating diffusion language model inference via efficient kv caching and guided diffusion. arXiv preprint arXiv:2505.21467, 2025

work page arXiv 2025

[36] [36]

Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

work page arXiv 2025

[37] [37]

Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025

Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025

work page arXiv 2025

[38] [38]

Cdlm: Consistency diffusion language models for faster sampling.arXiv preprint arXiv:2511.19269, 2025

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, and Amir Gholami. Cdlm: Consistency diffusion language models for faster sampling.arXiv preprint arXiv:2511.19269, 2025

work page arXiv 2025

[39] [39]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[40] [40]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024

[41] [41]

Beyond fixed: Variable-length denoising for diffusion large language models.arXiv e-prints, pages arXiv–2508, 2025

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Variable-length denoising for diffusion large language models.arXiv e-prints, pages arXiv–2508, 2025

work page 2025

[42] [42]

Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b

Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models.arXiv preprint arXiv:2508.10875, 2025

work page arXiv 2025

[43] [43]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[45] [45]

Wedlm: Reconciling diffusion language models with standard causal atten- tion for fast inference.arXiv preprint arXiv:2512.22737,

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. Wedlm: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025

work page arXiv 2025

[46] [46]

Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

work page arXiv 2025

[47] [47]

Longllada: Unlocking long context capabilities in diffusion llms

Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Longllada: Unlocking long context capabilities in diffusion llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32186–32194, 2026

work page 2026

[48] [48]

Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan Song, Minghui Lin, Han Zhao, Hongyin Zhang, Zifeng Zhuang, Wei Zhao, et al. Mmada-vla: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

work page arXiv 2026

[49] [49]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025

[50] [50]

Focus- dllm: Accelerating long-context diffusion llm inference via confidence-guided context focusing.arXiv preprint arXiv:2602.02159, 2026

Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, and Jianlei Yang. Focus- dllm: Accelerating long-context diffusion llm inference via confidence-guided context focusing.arXiv preprint arXiv:2602.02159, 2026

work page arXiv 2026

[51] [51]

Discrete diffusion language modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. 2023

work page 2023

[52] [52]

Diffusion in diffusion: Breaking the autoregressive bottleneck in block diffusion models.arXiv preprint arXiv:2601.13599, 2026

Linrui Ma, Yufei Cui, Kai Han, and Yunhe Wang. Diffusion in diffusion: Breaking the autoregressive bottleneck in block diffusion models.arXiv preprint arXiv:2601.13599, 2026

work page arXiv 2026

[53] [53]

dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 13

work page arXiv 2025

[54] [54]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, et al. dinfer: An efficient inference framework for diffusion language models.arXiv preprint arXiv:2510.08666, 2025

work page arXiv 2025

[55] [55]

A diverse corpus for evaluating and developing english math word problem solvers, 2021

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021

work page 2021

[56] [56]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

work page arXiv 2025

[57] [57]

The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, et al. The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

work page arXiv 2026

[58] [58]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, et al. d-treerpo: Towards more reliable policy optimization for diffusion language models.arXiv preprint arXiv:2512.09675, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models

Xiaojing Qi, Lun Du, Xinyuan Zhang, Lanning Wei, Tao Jin, and Da Zheng. Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models. InThe Fourteenth International Conference on Learning Representations

work page

[62] [62]

d3llm: Ultra-fast diffusion llm using pseudo- trajectory distillation.arXiv preprint arXiv:2601.07568,

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

work page arXiv 2026

[63] [63]

Improving reasoning for diffusion language models via group diffusion policy optimization

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, and Wei Deng. Improving reasoning for diffusion language models via group diffusion policy optimization.arXiv preprint arXiv:2510.08554, 2025

work page arXiv 2025

[64] [64]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[65] [65]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023

[66] [66]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024

[67] [67]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

work page arXiv 2025

[68] [68]

Simple guidance mecha- nisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Alexander Rush, V olodymyr Kuleshov, Hugo Dalla-Torre, Sam Boshar, Bernardo P de Almeida, and Thomas Pierrot. Simple guidance mecha- nisms for discrete diffusion models. In... International Conference on Learning Representations, volume 2025, page 44153, 2025

work page 2025

[69] [69]

Scaling beyond masked diffusion language models.arXiv e-prints, pages arXiv–2602, 2026

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv e-prints, pages arXiv–2602, 2026

work page 2026

[70] [70]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131– 103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131– 103167, 2024

work page 2024

[71] [71]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[72] [72]

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026. 14

work page 2026

[73] [73]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

work page arXiv 2025

[75] [75]

From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776,

Yuchuan Tian, Yuchen Liang, Shuo Zhang, Yingte Shu, Guangwen Yang, Wei He, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, et al. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2025

work page arXiv 2025

[76] [76]

Generalized interpolating discrete diffusion, 2025

Dimitri V on Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482, 2025

work page arXiv 2025

[77] [77]

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, and Weiyao Lin. Creditdecoding: Accelerating parallel decoding in diffusion large language models with trace credits. arXiv preprint arXiv:2510.06133, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192,

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025

[79] [79]

Sparsed: Sparse attention for diffusion language models

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[80] [80]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022