hub

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192

Diffusion llms can do faster-than-ar inference via discrete diffusion forcing , author= · 2025 · cs.LG · arXiv 2508.09192

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

cs.CL · 2026-02-09 · unverdicted · novelty 7.0

TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.

Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

cs.CL · 2026-07-07 · accept · novelty 6.0

Joint AR–diffusion training yields one tri-mode LM that switches AR, diffusion, and self-speculation, beating open AR/diffusion models on accuracy and tokens-per-forward.

Multi-Block Diffusion Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0 · 2 refs

MBD-LMs raise average tokens per forward pass from 3.47 to 6.19 (and to 9.34 with DMax) via multi-block teacher forcing and optimized parallel decoding while holding or slightly improving accuracy on math and code tasks.

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.

Global Sketch-Based Watermarking for Diffusion Language Models

cs.CR · 2026-06-03 · unverdicted · novelty 6.0

Introduces a sketch-based watermarking method for masked diffusion language models providing an order-agnostic detection statistic decoupled from local context.

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

cs.CV · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

cs.CL · 2026-04-26 · unverdicted · novelty 6.0

Query position is a first-order variable in dLLM ICL whose variance matches semantic quality impact; mitigated via Average Confidence metric and training-free Auto-ICL routing.

STDec: Spatio-Temporal Stability Guided Decoding for dLLMs

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

STDec raises dLLM decoding speed by up to 14x on benchmarks like MBPP by using observed spatio-temporal stability to create dynamic, token-specific confidence thresholds while preserving task performance.

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

cs.CL · 2025-12-16 · unverdicted · novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

cs.LG · 2025-12-10 · conditional · novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

cs.LG · 2026-04-10 · unverdicted · novelty 5.0 · 2 refs

ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.

citing papers explorer

Showing 15 of 15 citing papers.

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 15
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection cs.LG · 2026-05-09 · unverdicted · none · ref 14
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 78 · 2 links
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration cs.CL · 2026-02-09 · unverdicted · none · ref 23
TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.
Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding cs.CL · 2026-07-07 · accept · none · ref 66 · internal anchor
Joint AR–diffusion training yields one tri-mode LM that switches AR, diffusion, and self-speculation, beating open AR/diffusion models on accuracy and tokens-per-forward.
Multi-Block Diffusion Language Models cs.LG · 2026-06-28 · unverdicted · none · ref 43 · 2 links
MBD-LMs raise average tokens per forward pass from 3.47 to 6.19 (and to 9.34 with DMax) via multi-block teacher forcing and optimized parallel decoding while holding or slightly improving accuracy on math and code tasks.
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling cs.LG · 2026-06-09 · unverdicted · none · ref 55
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
Global Sketch-Based Watermarking for Diffusion Language Models cs.CR · 2026-06-03 · unverdicted · none · ref 54
Introduces a sketch-based watermarking method for masked diffusion language models providing an order-agnostic detection statistic decoupled from local context.
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 23
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models cs.CV · 2026-05-14 · unverdicted · none · ref 20 · 2 links
Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.
Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics cs.CL · 2026-04-26 · unverdicted · none · ref 27
Query position is a first-order variable in dLLM ICL whose variance matches semantic quality impact; mitigated via Average Confidence metric and training-free Auto-ICL routing.
STDec: Spatio-Temporal Stability Guided Decoding for dLLMs cs.CL · 2026-04-07 · unverdicted · none · ref 5
STDec raises dLLM decoding speed by up to 14x on benchmarks like MBPP by using observed spatio-temporal stability to create dynamic, token-specific confidence thresholds while preserving task performance.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed cs.CL · 2025-12-16 · unverdicted · none · ref 46
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 35
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion cs.LG · 2026-04-10 · unverdicted · none · ref 47 · 2 links
ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer