Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

· 2025 · cs.LG · arXiv 2512.07173

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 1.1-2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.

citing papers explorer

Showing 2 of 2 citing papers.

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 47 · internal anchor
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction cs.CL · 2026-04-21 · unverdicted · none · ref 14 · internal anchor
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.

Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer