SAFE-DiT: Semantics-Aware Fast-path Execution for High-Resolution Diffusion Transformers

Chuanzhi Xu; Weidong Cai; Xuanhua Yin; Yuxuan Jia

arxiv: 2606.29360 · v1 · pith:PZQETG2Tnew · submitted 2026-06-28 · 💻 cs.CV

SAFE-DiT: Semantics-Aware Fast-path Execution for High-Resolution Diffusion Transformers

Xuanhua Yin , Yuxuan Jia , Chuanzhi Xu , Weidong Cai This is my paper

Pith reviewed 2026-06-30 07:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion transformerhigh-resolution inferenceattention maskfast-path executiontoken partitioningmemory reduction

0 comments

The pith

SAFE-DiT accelerates high-resolution Diffusion Transformer inference by eliding image self-attention masks that induce only row-wise constant shifts in attention logits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution DiT models contain spatial redundancy that attention masks attempt to exploit, yet many masks push scaled dot-product attention off fast execution paths and into slower dispatch. The paper shows that provenance-certified image self-attention masks create nothing more than a uniform additive shift per row of logits and can be removed exactly while text-padding masks that carry semantic information are retained. Spatial adaptation is then achieved through prompt-conditioned token partitioning, selective state updates that incorporate global context, and periodic context refresh. On Lumina-Next this yields 2.69 times end-to-end speedup at 1024 squared resolution and 5.09 times at 2560 squared, with peak memory dropping from 94.1 GB to 27.9 GB. A reader would care because the changes enable generation at resolutions that otherwise exhaust memory or run impractically slowly, all without retraining or altering the model weights.

Core claim

Provenance-certified image self-attention masks that induce only a row-wise constant shift in attention logits can be removed without semantic loss, text-padding masks are preserved, and spatial adaptation is realized via prompt-conditioned token partitioning, selective state updates with global context, and periodic context refresh, delivering up to 5.09 times acceleration and enabling 3072 squared generation.

What carries the argument

Exact elision of constant-shift image self-attention masks together with prompt-conditioned token partitioning and selective state updates.

If this is right

End-to-end acceleration reaches 2.69 times at 1024 squared resolution on Lumina-Next.
Acceleration reaches 5.09 times at 2560 squared resolution while peak memory falls from 94.1 GB to 27.9 GB.
Generation at 3072 squared becomes feasible when dense inference exceeds available memory.
Metrics, component ablations, and a blinded human study indicate visual non-inferiority to the dense fast-path baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mask-elision test could be applied to other DiT architectures to identify which attention masks are safe to drop.
The selective state-update schedule might be made prompt-dependent to further reduce compute on simple scenes.
Periodic context refresh rate offers a tunable knob between speed and long-range coherence that could be optimized per model size.

Load-bearing premise

Image self-attention masks that add only a constant shift to each row of attention logits can be dropped without changing the semantic content of the generated images.

What would settle it

Generate paired images from the same prompt and seed with the masks present versus removed, then test whether a standard perceptual similarity score drops below the paper's non-inferiority threshold or whether blinded human raters detect a difference.

Figures

Figures reproduced from arXiv: 2606.29360 by Chuanzhi Xu, Weidong Cai, Xuanhua Yin, Yuxuan Jia.

**Figure 1.** Figure 1: SAFE-DiT separates exact mask elision from mask-free spatial scheduling. Certified redundant image self-attention masks are removed to avoid the Mask-Induced Dispatch Tax, semantics-bearing text masks are retained, and non-anchor computation is expressed through sensitive-query updates and context reuse. ing metadata, even when they leave the mathematical attention operator unchanged. Modern inference sta… view at source ↗

**Figure 2.** Figure 2: SAFE-DiT pipeline. A short warm-up aggregates image-to-text attention into a sensitivity map and fixes sensitive/context tokens. Anchor steps refresh all tokens; non-anchor steps update sensitive queries, reuse context keys/values, elide certified redundant self-attention masks, and keep text masks. SW-CFG applies sensitivity-weighted per-token guidance. and the corresponding image-text cross-attention as:… view at source ↗

**Figure 3.** Figure 3: High-resolution speed–memory frontier. SAFE-DiT’s advantage increases with resolution on Lumina-Next and enables [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Paired 20482 examples with matched prompt/seed and PixArt-Σ-style generation. SAFE-DiT improves local detail while keeping composition close to Dense. categories while the SW-CFG module lifts the attributebinding average from 0.404 to 0.419, showing that SAFEDiT’s guidance module mainly affects prompt alignment. 4.4. Visualization As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mask-Induced Dispatch Tax. A redundant mask makes [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Speed and quality Pareto on Lumina-Next (DrawBench). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-backbone speedup. The masked backbone [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Module roles. Left: removing a speed module (fast [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Context-state drift over denoising steps. Without CAR, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: PCSP sensitivity. (a) The partition’s center of mass [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Blind VLM win/tie/loss for SAFE-DiT vs. Dense. Both [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Paired qualitative comparison (matched prompt/seed). Columns: Dense, SAFE-DiT, DPCache, TeaCache, ClusCa, FastCache. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

High-resolution Diffusion Transformer (DiT) inference contains substantial spatial redundancy, but many spatially adaptive implementations encode regional computation as attention masks, which can inadvertently move scaled dot-product attention (SDPA) away from FlashAttention fast paths. We identify this avoidable systems bottleneck as Mask-Induced Dispatch Tax (MIDT) and show that it grows with latent sequence length. We introduce SAFE-DiT, a training-free Semantics-Aware Fast-path Execution framework that separates exact mask elision from approximation-based spatial scheduling. SAFE-DiT removes only provenance-certified image self-attention masks that induce a row-wise constant shift in attention logits, preserves semantics-bearing masks such as text-padding masks, and realizes spatial adaptation through prompt-conditioned token partitioning, selective state updates with global context, and periodic context refresh. We call this acceleration-only configuration SAFE-Core and report sensitivity-weighted classifier-free guidance separately as SAFE-DiT+SW. On the evaluated PyTorch SDPA stack, redundant masks make long-sequence attention $4.1\times$ to $5.8\times$ slower than the mask-free path. On Lumina-Next, SAFE-DiT achieves $2.69\times$ end-to-end acceleration at $1024^2$ resolution and $5.09\times$ at $2560^2$, reduces peak memory at $2560^2$ from 94.1 to 27.9 GB, and enables $3072^2$ generation when dense inference runs out of memory. Paired metrics, component ablations, and a blinded human study support visual non-inferiority of SAFE-Core to the dense fast-path baseline, while SAFE-DiT+SW provides a separate prompt-alignment operating point without reintroducing spatial self-attention masks. Code is available at https://github.com/xuanhuayin/SAFE-DiT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAFE-DiT is a practical systems tweak that cuts mask overhead in high-res DiT attention for real speed and memory gains, but the approximation in scheduling lacks formal checks on quality drift.

read the letter

SAFE-DiT targets the overhead from attention masks in Diffusion Transformers that push computation off FlashAttention paths. They name this MIDT and split the fix into exact mask removal for cases where logits shift by a constant (which softmax ignores) versus approximation for spatial adaptation via prompt-conditioned partitioning, selective updates, and periodic refresh.

The paper does the engineering cleanly. On Lumina-Next it reports 2.69× end-to-end speedup at 1024² and 5.09× at 2560², drops peak memory from 94 GB to 28 GB at the higher res, and runs 3072² when the dense version OOMs. They keep text-padding masks, release code, run component ablations, and include a blinded human study claiming visual non-inferiority for the core version.

The soft spot is exactly where the stress-test note points: the scheduling is approximation-based, and nothing in the abstract or reported results gives a bound or worst-case test showing periodic global refresh prevents semantic drift on arbitrary prompts at long sequences. The human study is cited but its scale and variance are not detailed here, so the non-inferiority claim stays empirical.

This is for inference engineers working on DiT deployments who need lower hardware costs at high resolution. The concrete numbers and public code make it worth a referee's time even if revisions would need tighter validation on the approximation.

Referee Report

2 major / 3 minor

Summary. The paper introduces SAFE-DiT, a training-free framework for high-resolution Diffusion Transformer inference that identifies Mask-Induced Dispatch Tax (MIDT) arising from attention masks that divert SDPA from FlashAttention fast paths. It separates exact elision of provenance-certified image self-attention masks (which induce only row-wise constant logit shifts) from approximation-based spatial scheduling via prompt-conditioned token partitioning, selective state updates with global context, and periodic refresh. On Lumina-Next it reports 2.69× end-to-end speedup at 1024² and 5.09× at 2560², memory reduction from 94.1 GB to 27.9 GB at 2560², and the ability to run 3072² when dense inference OOMs; ablations, paired metrics, and a blinded human study are cited to support visual non-inferiority of the SAFE-Core configuration.

Significance. If the empirical quality claims hold, the work offers a practical, systems-level improvement for DiT inference that preserves the FlashAttention fast path while enabling higher resolutions on constrained hardware. The explicit separation of exact mask elision from approximation-based scheduling, together with public code, strengthens reproducibility and potential adoption.

major comments (2)

[§4] §4 (spatial scheduling): the selective state updates with periodic global-context refresh constitute an approximation whose semantic fidelity is asserted via ablations and human study, yet no worst-case bound or drift analysis is supplied for arbitrary prompts in long-sequence, high-resolution regimes; this is load-bearing for the non-inferiority claim.
[Human Study] Human-study section: the blinded study is invoked to support non-inferiority, but the manuscript provides no details on participant count, prompt selection, rating protocol, or statistical power, preventing assessment of whether the evidence is sufficient to underwrite the central quality-preservation claim.

minor comments (3)

[§3] The abstract states that redundant masks make long-sequence attention 4.1×–5.8× slower; the corresponding measurement protocol and hardware stack should be stated explicitly in §3 or the experimental section.
[§3] Notation for the row-wise constant logit shift induced by elided masks should be formalized with an equation to make the invariance argument fully precise.
The distinction between SAFE-Core and SAFE-DiT+SW should be summarized in a single table or paragraph early in the paper for reader orientation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the detailed comments. We address each major point below.

read point-by-point responses

Referee: [§4] §4 (spatial scheduling): the selective state updates with periodic global-context refresh constitute an approximation whose semantic fidelity is asserted via ablations and human study, yet no worst-case bound or drift analysis is supplied for arbitrary prompts in long-sequence, high-resolution regimes; this is load-bearing for the non-inferiority claim.

Authors: We acknowledge that selective state updates with periodic refresh form an approximation. The design intentionally uses prompt-conditioned partitioning and periodic full-context refresh to limit semantic drift, and the manuscript already includes ablations that vary the refresh interval and measure impact on FID, CLIP score, and human preference. Deriving a general worst-case bound on drift for arbitrary prompts is difficult because diffusion trajectories depend on the stochastic sampler, the specific prompt semantics, and the interaction between text and image tokens; such analysis would require assumptions that do not hold across the broad prompt distribution used in practice. We therefore rely on the empirical evidence already presented and do not plan to add a theoretical bound. revision: no
Referee: [Human Study] Human-study section: the blinded study is invoked to support non-inferiority, but the manuscript provides no details on participant count, prompt selection, rating protocol, or statistical power, preventing assessment of whether the evidence is sufficient to underwrite the central quality-preservation claim.

Authors: We agree that the human-study protocol details were omitted. In the revised manuscript we will add the number of participants, the prompt-selection procedure, the exact rating interface and scale, the blinding procedure, and the statistical tests (including power analysis) used to support the non-inferiority conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems technique with measured speedups and ablations, not derived predictions.

full rationale

The paper presents SAFE-DiT as a training-free implementation framework separating exact mask elision (proven invariant under row-wise logit shift) from approximation-based scheduling. Reported accelerations (2.69× at 1024², 5.09× at 2560²) and memory reductions are direct runtime measurements on PyTorch SDPA, not outputs of any fitted model or self-referential equation. Quality claims rest on component ablations and a blinded human study rather than a closed derivation. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The work is self-contained as an engineering optimization whose correctness is externally verifiable by reproduction on the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the listed mask categories and scheduling rules are sufficient to preserve semantics.

pith-pipeline@v0.9.1-grok · 5886 in / 1209 out tokens · 25657 ms · 2026-06-30T07:13:50.648212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, 2022. 3

2022
[2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Ait- tala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Kar- ras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion mod- els with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

All are worth words: A ViT back- bone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT back- bone for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023. 3

2023
[4]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learn- ing (ICML), 2023. 1, 3

2023
[5]

FLUX.1 models, 2024

Black Forest Labs. FLUX.1 models, 2024. 6

2024
[6]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4), 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4), 2023. 1, 3, 5

2023
[7]

Pixart-σ: Weak-to-strong training of dif- fusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of dif- fusion transformer for 4k text-to-image generation. InEuro- pean Conference on Computer Vision (ECCV), 2024. 1, 3, 6

2024
[8]

Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis. InInter- national Conference on Learning Representations (ICLR),
[9]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.∆-DiT: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024. 3, 7

work page arXiv 2024
[10]

Denoising as path planning: Training-free acceler- ation of diffusion models with DPCache

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, and Pipei Huang. Denoising as path planning: Training-free acceler- ation of diffusion models with DPCache. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 3, 7

2026
[11]

FlashAttention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3, 6

2024
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16344– 16359, 2022. 2, 3, 6

2022
[13]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2021. 1, 3

2021
[14]

FlexAttention: A programming model for generating fused attention variants

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. FlexAttention: A programming model for generating fused attention variants. InProceedings of Ma- chine Learning and Systems (MLSys), 2025. 2, 3

2025
[15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021. 1, 3

2021
[16]

Scaling rec- tified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning...

2024
[17]

Lumina-T2X: Scalable flow-based large diffusion transformer for flexible resolution generation

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xie, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, Tong He, Jingwen He, Junjun He, Yu Qiao, and Hongsheng Li. Lumina-T2X: Scalable flow-based large diffusion transformer for flexible resolution generation. InInternational...

2025
[18]

GenEval: An object-focused framework for evaluating text- 9 to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text- 9 to-image alignment. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023. 6

2023
[19]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. InInternational Confer- ence on Learning Representations (ICLR), 2023. 1, 3, 5

2023
[20]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Jain, and Pieter Abbeel

Jonathan Ho, Ajay N. Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Infor- mation Processing Systems, pages 6840–6851, 2020. 1, 3

2020
[22]

T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation. InAd- vances in Neural Information Processing Systems (NeurIPS),
[23]

GLIGEN: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. GLIGEN: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 22511–22521,
[24]

Snap- Fusion: Text-to-image diffusion model on mobile devices within two seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- Fusion: Text-to-image diffusion model on mobile devices within two seconds. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023. 3

2023
[25]

SDiT: Semantic region-adaptive for diffusion transformers.arXiv preprint arXiv:2601.12283, 2026

Bowen Lin, Fanjiang Ye, Yihua Liu, Zhenghui Guo, Boyuan Zhang, Weijian Zheng, Yufan Xu, Tiancheng Xing, Yuke Wang, and Chengming Zhang. SDiT: Semantic region-adaptive for diffusion transformers.arXiv preprint arXiv:2601.12283, 2026. 3, 7

work page arXiv 2026
[26]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. 6

2014
[27]

FastCache: Fast caching for diffusion transformer through learnable linear approxi- mation.arXiv preprint arXiv:2505.20353, 2025

Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, and Ying Nian Wu. FastCache: Fast caching for diffusion transformer through learnable linear approxi- mation.arXiv preprint arXiv:2505.20353, 2025. 3, 7

work page arXiv 2025
[28]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
[29]

From reusing to forecasting: Accelerat- ing diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerat- ing diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, 2025. 3, 7

2025
[30]

Region-adaptive sampling for diffusion transformers

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region-adaptive sampling for diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 3, 6, 7

2026
[31]

DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps. InAd- vances in Neural Information Processing Systems (NeurIPS),
[32]

DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, 22:730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, 22:730–751, 2025. 3

2025
[33]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

DeepCache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15762–15772, 2024. 1, 3

2024
[35]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 3

2022
[36]

GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion models. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 16784–16804. PMLR,
[37]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 1, 3

2023
[38]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1, 3

2024
[39]

PyTorch, 2026

PyTorch Contributors.PyTorch Documentation: Scaled Dot Product Attention. PyTorch, 2026. 2, 3, 6

2026
[40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 3

2022
[42]

Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, Jonathan Ho, David J

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep lan- guage understanding. InAdvances in Neural Information Processing Systems (NeurIPS)...

2022
[43]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Confer- ence on Learning Representations (ICLR), 2022. 3

2022
[44]

Fast high- resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InACM SIGGRAPH Asia Conference Papers (SIGGRAPH Asia), 2024. 3

2024
[45]

arXiv preprint arXiv:2407.01425 (2024)

Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. FORA: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

work page arXiv
[46]

FlashAttention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems (NeurIPS),
[47]

Rethinking the spatial inconsistency in classifier- free diffusion guidance

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier- free diffusion guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9370–9379, 2024. 1, 3, 7

2024
[48]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. 1, 3

2021
[49]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning (ICML), 2023. 1, 3

2023
[50]

What the DAAM: Interpreting stable dif- fusion using cross attention

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the DAAM: Interpreting stable dif- fusion using cross attention. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 5644–5659, 2023. 1, 3, 5

2023
[51]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 1, 3

2017
[52]

FlashMask: Efficient and rich mask ex- tension of FlashAttention

Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. FlashMask: Efficient and rich mask ex- tension of FlashAttention. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3

2025
[53]

Cache me if you can: Accelerating diffusion models through block caching

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cre- mers, Peter Vajda, and Jialiang Wang. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Rec...
[54]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

ImageRe- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageRe- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2, 6

2023
[56]

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Xuanhua Yin, Chuanzhi Xu, Haoxian Zhou, Boyu Wei, and Weidong Cai. AccelAes: Accelerating diffusion transform- ers for training-free aesthetic-enhanced image generation. arXiv preprint arXiv:2603.12575, 2026. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Scaling autoregressive models for content-rich text-to-image generation.Transactions on Machine Learn- ing Research, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation.Transactions on Machine Learn- ing Research, 2022. 3, 6

2022
[58]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 1, 3

2023
[59]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 586–595, 2018. 6

2018
[60]

Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching

Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching. InProceedings of the ACM International Con- ference on Multimedia (ACM MM), pages 10181–10189,
[61]

complementary

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Lin- feng Zhang. Accelerating diffusion transformers with token- wise feature caching. InInternational Conference on Learn- ing Representations (ICLR), 2025. 3 11 Appendix Overview This appendix provides the full experimental record behind the main paper: the component-level relation to the closest spatia...

2025
[62]

Target” and “reference

Text encoding, V AE decoding, and post-processing stay small and stable (∼0.03 / 0.13 / 0.02 s), and the sched- uler/PCSP overhead is 0.16 to 0.26 s. G. High-Resolution Frontier Table 11 gives the full numeric frontier on Lumina-Next, and Table 12 adds the strongest acceleration baselines. Temporal caches skip steps but still materialize full atten- tion ...

2048

[1] [1]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, 2022. 3

2022

[2] [2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Ait- tala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Kar- ras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion mod- els with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

All are worth words: A ViT back- bone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT back- bone for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023. 3

2023

[4] [4]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learn- ing (ICML), 2023. 1, 3

2023

[5] [5]

FLUX.1 models, 2024

Black Forest Labs. FLUX.1 models, 2024. 6

2024

[6] [6]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4), 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4), 2023. 1, 3, 5

2023

[7] [7]

Pixart-σ: Weak-to-strong training of dif- fusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of dif- fusion transformer for 4k text-to-image generation. InEuro- pean Conference on Computer Vision (ECCV), 2024. 1, 3, 6

2024

[8] [8]

Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis. InInter- national Conference on Learning Representations (ICLR),

[9] [9]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.∆-DiT: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024. 3, 7

work page arXiv 2024

[10] [10]

Denoising as path planning: Training-free acceler- ation of diffusion models with DPCache

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, and Pipei Huang. Denoising as path planning: Training-free acceler- ation of diffusion models with DPCache. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 3, 7

2026

[11] [11]

FlashAttention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3, 6

2024

[12] [12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16344– 16359, 2022. 2, 3, 6

2022

[13] [13]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2021. 1, 3

2021

[14] [14]

FlexAttention: A programming model for generating fused attention variants

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. FlexAttention: A programming model for generating fused attention variants. InProceedings of Ma- chine Learning and Systems (MLSys), 2025. 2, 3

2025

[15] [15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021. 1, 3

2021

[16] [16]

Scaling rec- tified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning...

2024

[17] [17]

Lumina-T2X: Scalable flow-based large diffusion transformer for flexible resolution generation

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xie, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, Tong He, Jingwen He, Junjun He, Yu Qiao, and Hongsheng Li. Lumina-T2X: Scalable flow-based large diffusion transformer for flexible resolution generation. InInternational...

2025

[18] [18]

GenEval: An object-focused framework for evaluating text- 9 to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text- 9 to-image alignment. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023. 6

2023

[19] [19]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. InInternational Confer- ence on Learning Representations (ICLR), 2023. 1, 3, 5

2023

[20] [20]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Jain, and Pieter Abbeel

Jonathan Ho, Ajay N. Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Infor- mation Processing Systems, pages 6840–6851, 2020. 1, 3

2020

[22] [22]

T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation. InAd- vances in Neural Information Processing Systems (NeurIPS),

[23] [23]

GLIGEN: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. GLIGEN: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 22511–22521,

[24] [24]

Snap- Fusion: Text-to-image diffusion model on mobile devices within two seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- Fusion: Text-to-image diffusion model on mobile devices within two seconds. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023. 3

2023

[25] [25]

SDiT: Semantic region-adaptive for diffusion transformers.arXiv preprint arXiv:2601.12283, 2026

Bowen Lin, Fanjiang Ye, Yihua Liu, Zhenghui Guo, Boyuan Zhang, Weijian Zheng, Yufan Xu, Tiancheng Xing, Yuke Wang, and Chengming Zhang. SDiT: Semantic region-adaptive for diffusion transformers.arXiv preprint arXiv:2601.12283, 2026. 3, 7

work page arXiv 2026

[26] [26]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. 6

2014

[27] [27]

FastCache: Fast caching for diffusion transformer through learnable linear approxi- mation.arXiv preprint arXiv:2505.20353, 2025

Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, and Ying Nian Wu. FastCache: Fast caching for diffusion transformer through learnable linear approxi- mation.arXiv preprint arXiv:2505.20353, 2025. 3, 7

work page arXiv 2025

[28] [28]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

[29] [29]

From reusing to forecasting: Accelerat- ing diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerat- ing diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, 2025. 3, 7

2025

[30] [30]

Region-adaptive sampling for diffusion transformers

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region-adaptive sampling for diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 3, 6, 7

2026

[31] [31]

DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps. InAd- vances in Neural Information Processing Systems (NeurIPS),

[32] [32]

DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, 22:730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, 22:730–751, 2025. 3

2025

[33] [33]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

DeepCache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15762–15772, 2024. 1, 3

2024

[35] [35]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 3

2022

[36] [36]

GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion models. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 16784–16804. PMLR,

[37] [37]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 1, 3

2023

[38] [38]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), 2024. 1, 3

2024

[39] [39]

PyTorch, 2026

PyTorch Contributors.PyTorch Documentation: Scaled Dot Product Attention. PyTorch, 2026. 2, 3, 6

2026

[40] [40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 3

2022

[42] [42]

Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, Jonathan Ho, David J

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep lan- guage understanding. InAdvances in Neural Information Processing Systems (NeurIPS)...

2022

[43] [43]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Confer- ence on Learning Representations (ICLR), 2022. 3

2022

[44] [44]

Fast high- resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InACM SIGGRAPH Asia Conference Papers (SIGGRAPH Asia), 2024. 3

2024

[45] [45]

arXiv preprint arXiv:2407.01425 (2024)

Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. FORA: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

work page arXiv

[46] [46]

FlashAttention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems (NeurIPS),

[47] [47]

Rethinking the spatial inconsistency in classifier- free diffusion guidance

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier- free diffusion guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9370–9379, 2024. 1, 3, 7

2024

[48] [48]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. 1, 3

2021

[49] [49]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning (ICML), 2023. 1, 3

2023

[50] [50]

What the DAAM: Interpreting stable dif- fusion using cross attention

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the DAAM: Interpreting stable dif- fusion using cross attention. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 5644–5659, 2023. 1, 3, 5

2023

[51] [51]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 1, 3

2017

[52] [52]

FlashMask: Efficient and rich mask ex- tension of FlashAttention

Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. FlashMask: Efficient and rich mask ex- tension of FlashAttention. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3

2025

[53] [53]

Cache me if you can: Accelerating diffusion models through block caching

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cre- mers, Peter Vajda, and Jialiang Wang. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Rec...

[54] [54]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

ImageRe- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageRe- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2, 6

2023

[56] [56]

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Xuanhua Yin, Chuanzhi Xu, Haoxian Zhou, Boyu Wei, and Weidong Cai. AccelAes: Accelerating diffusion transform- ers for training-free aesthetic-enhanced image generation. arXiv preprint arXiv:2603.12575, 2026. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Scaling autoregressive models for content-rich text-to-image generation.Transactions on Machine Learn- ing Research, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation.Transactions on Machine Learn- ing Research, 2022. 3, 6

2022

[58] [58]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 1, 3

2023

[59] [59]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 586–595, 2018. 6

2018

[60] [60]

Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching

Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching. InProceedings of the ACM International Con- ference on Multimedia (ACM MM), pages 10181–10189,

[61] [61]

complementary

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Lin- feng Zhang. Accelerating diffusion transformers with token- wise feature caching. InInternational Conference on Learn- ing Representations (ICLR), 2025. 3 11 Appendix Overview This appendix provides the full experimental record behind the main paper: the component-level relation to the closest spatia...

2025

[62] [62]

Target” and “reference

Text encoding, V AE decoding, and post-processing stay small and stable (∼0.03 / 0.13 / 0.02 s), and the sched- uler/PCSP overhead is 0.16 to 0.26 s. G. High-Resolution Frontier Table 11 gives the full numeric frontier on Lumina-Next, and Table 12 adds the strongest acceleration baselines. Temporal caches skip steps but still materialize full atten- tion ...

2048