DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
Learning-to-cache: Accelerating diffusion transformer via layer caching
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.
AdaCorrection adaptively corrects offset caches in DiT inference via on-the-fly spatio-temporal validity checks to maintain near-original FID with moderate acceleration.
citing papers explorer
-
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
-
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
-
S2O: Early Stopping for Sparse Attention via Online Permutation
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
-
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.
-
AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers
AdaCorrection adaptively corrects offset caches in DiT inference via on-the-fly spatio-temporal validity checks to maintain near-original FID with moderate acceleration.