TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
hub
DKV-Cache: The cache for diffusion language models
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
Theoretical analysis reveals MaskGIT's implicit temperature sampling in masked diffusion; proposes equivalent moment sampler and efficiency techniques for adaptive unmasking with image and text experiments.
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.
citing papers explorer
-
Drifting Objectives for Refining Discrete Diffusion Language Models
TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
-
Muninn: Your Trajectory Diffusion Model But Faster
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
-
Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Theoretical analysis reveals MaskGIT's implicit temperature sampling in masked diffusion; proposes equivalent moment sampler and efficiency techniques for adaptive unmasking with image and text experiments.
-
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
-
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
-
Consistent Diffusion Language Models
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
Diffusion Language Models Know the Answer Before Decoding
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
-
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
-
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.