dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Chang Zou; Junjie Chen; Linfeng Zhang; Qingyuan Wei; Shaobo Wang; Yaojie Zhang; Yichen Zhu; Yicun Yang; Zhiyuan Liu

arxiv: 2506.06295 · v2 · pith:3RK74S3Fnew · submitted 2025-05-17 · 💻 cs.LG · cs.AI· cs.CL

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Yichen Zhu

show 1 more author

Linfeng Zhang

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords cachingdllm-cachedllmsmodelsinferencelanguagelargeadaptive

0 comments

read the original abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Drifting Objectives for Refining Discrete Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
cs.CL 2026-04 unverdicted novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
cs.LG 2026-04 unverdicted novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 conditional novelty 7.0

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
cs.CL 2026-03 unverdicted novelty 7.0

Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
Locally Coherent Parallel Decoding in Diffusion Language Models
cs.CL 2026-03 unverdicted novelty 7.0

CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration
cs.CL 2026-05 unverdicted novelty 6.0

FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
cs.CL 2026-05 unverdicted novelty 6.0

Diffusion LLMs can act as their own efficiency teachers by using revokable parallel decoding to identify reliable token orders and then distilling those orders into the model parameters for faster inference.
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
Diffusion Language Models Know the Answer Before Decoding
cs.CL 2025-08 conditional novelty 6.0

DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 5.0

ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...