arxiv: 2603.26357 · v2 · submitted 2026-03-27 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Quan Dao , Dimitris Metaxas

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-patch transformerdiffusion transformerflow matchingefficient generative modelshierarchical architectureimage synthesiscomputational efficiencyImageNet

0 comments

The pith

Multi-patch global-to-local transformers halve computational cost in diffusion models while keeping generative performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hierarchical transformer for diffusion and flow-matching models that processes larger patches in early blocks to capture coarse global context and switches to smaller patches in later blocks to refine local details. This replaces the fixed patch size used throughout standard isotropic DiTs, which the authors argue drives unnecessary computation during training. Combined with redesigned time and class embeddings, the approach is shown to cut GFLOPs by up to 50 percent on ImageNet while preserving competitive generative quality. A reader would care because current DiT training remains expensive, and a simple change in token hierarchy offers a direct path to lower resource use without new hardware.

Core claim

The central claim is that a multi-patch global-to-local transformer architecture, in which early blocks operate on larger patches and later blocks operate on smaller patches, reduces computational cost by up to 50 percent in GFLOPs while achieving good generative performance on ImageNet for both diffusion and flow-matching models.

What carries the argument

The multi-patch global-to-local hierarchy that varies patch size across successive transformer blocks to capture global structure first and local detail later.

If this is right

Diffusion and flow-matching models can be trained with substantially lower floating-point operations while retaining competitive image quality on ImageNet.
Redesigned time and class embeddings accelerate training convergence beyond the savings from patch hierarchy alone.
The same global-to-local patch progression applies directly to both diffusion and flow-matching training pipelines.
Generative performance holds without extra architectural compensations once the patch-size schedule is set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design may transfer naturally to video or 3D generation where global context precedes local refinement.
Combining the patch hierarchy with existing efficiency techniques such as pruning could produce additive savings.
Inference cost may also drop because early blocks already operate on fewer tokens, though the paper focuses on training.
Scaling the schedule to higher resolutions or larger models would test whether the 50 percent reduction holds proportionally.

Load-bearing premise

Switching patch sizes across blocks preserves the same generative quality and convergence behavior as a standard isotropic DiT without requiring additional compensatory changes.

What would settle it

A controlled comparison in which an isotropic DiT baseline with matched total compute or parameters achieves meaningfully better FID scores or faster convergence than MPDiT on ImageNet would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2603.26357 by Dimitris Metaxas, Quan Dao.

**Figure 2.** Figure 2: Architecture of MPDiT, which consists of (a) the Global-Local MultiPatch Diffusion Transformer, (b) DiT Block with shared time embedding, (c) The Upsample Module and (d) The FNO Time Embedding although lacking local detail, effectively captures the overall structure of the image. This observation suggests that adding refinement transformer blocks to enhance local representation could further improve the … view at source ↗

**Figure 3.** Figure 3: Qualitative Result of Imagenet 512 with cfg=4 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative images of class 33 ”loggerhead, loggerhead [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative images of class 84 ”peacock” [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 8.** Figure 8: Qualitative images of class 88 ”macaw” [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 10.** Figure 10: Qualitative images of class 417 ”balloon” [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 12.** Figure 12: Qualitative images of class 980 ”volcano” [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at: https://github.com/quandao10/MPDiT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MPDiT, a multi-patch global-to-local transformer for diffusion and flow-matching models. Early blocks operate on larger patches to capture coarse global context while later blocks use smaller patches for local refinement; the design is claimed to reduce GFLOPs by up to 50% relative to isotropic DiT while preserving generative performance on ImageNet. Improved time- and class-embedding schemes are also introduced to accelerate convergence. Code is released.

Significance. If the efficiency claims are substantiated under matched training budgets and model sizes, the hierarchical patch-size schedule would constitute a practical advance for scaling transformer-based generative models. The public code release is a clear strength that enables direct verification and extension.

major comments (2)

[§3.2] §3.2 (Multi-Patch Blocks): the transition operator between large-patch (coarse-token) and small-patch (dense-token) blocks is not specified. No equation or diagram describes the required token reshaping, interpolation, or projection, nor is its parameter or FLOP overhead included in the reported GFLOPs figures. This mechanism is load-bearing for the central 50% reduction claim.
[§4] §4 (Experiments): the abstract asserts “good generative performance” and a 50% GFLOPs saving, yet the provided text supplies no quantitative FID, IS, or precision-recall numbers, no matched-budget DiT baselines, and no ablation isolating the effect of the patch-size schedule from the embedding changes. Without these controls the efficiency claim cannot be evaluated.

minor comments (2)

[Abstract] Abstract: grammatical error in “This hierarchical design could reduces computational cost”.
[§3] Notation for patch-size schedule and token counts should be introduced with a single consistent symbol set (e.g., P_l for large-patch size) rather than repeated prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review and constructive feedback on our manuscript. We appreciate the recognition of the potential practical advance offered by the hierarchical patch-size schedule and the value of the public code release. We address each major comment below and will revise the manuscript to incorporate the requested details and results.

read point-by-point responses

Referee: [§3.2] §3.2 (Multi-Patch Blocks): the transition operator between large-patch (coarse-token) and small-patch (dense-token) blocks is not specified. No equation or diagram describes the required token reshaping, interpolation, or projection, nor is its parameter or FLOP overhead included in the reported GFLOPs figures. This mechanism is load-bearing for the central 50% reduction claim.

Authors: We agree that the transition mechanism requires a more precise specification. In the revised manuscript we will add explicit equations describing the token reshaping (via a learned linear projection to align feature dimensions) and any necessary spatial interpolation, together with a diagram of the block transition. We will also report the parameter count and FLOP overhead of the transition operator separately so that the overall GFLOPs reduction claim can be verified. revision: yes
Referee: [§4] §4 (Experiments): the abstract asserts “good generative performance” and a 50% GFLOPs saving, yet the provided text supplies no quantitative FID, IS, or precision-recall numbers, no matched-budget DiT baselines, and no ablation isolating the effect of the patch-size schedule from the embedding changes. Without these controls the efficiency claim cannot be evaluated.

Authors: We acknowledge that the current draft lacks the quantitative metrics, matched-budget baselines, and isolating ablations needed for rigorous evaluation. In the revision we will expand Section 4 with FID, IS, and precision-recall scores on ImageNet, direct comparisons against DiT models trained under identical compute budgets and parameter counts, and ablation tables that separately measure the contribution of the multi-patch schedule versus the improved embeddings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on ImageNet experiments, not self-referential derivations

full rationale

The paper introduces a multi-patch hierarchical DiT variant with early large-patch blocks for global context and later small-patch blocks for local refinement, claiming up to 50% GFLOPs reduction. All performance assertions are framed as outcomes of direct experiments on ImageNet rather than predictions derived from fitted parameters or self-citations. No equations, ansatzes, or uniqueness theorems are presented that reduce by construction to the inputs; the transition mechanism between patch sizes is described at the architectural level without invoking prior self-work as load-bearing justification. The design is self-contained against external benchmarks, yielding a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The design rests on the domain assumption that variable patch sizes can be processed by standard transformer blocks without breaking attention or positional encoding mechanics; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Transformer attention layers remain functional when input token count and spatial resolution change across blocks
Implicit in the multi-patch schedule described in the abstract.

invented entities (1)

Multi-patch global-to-local transformer blocks no independent evidence
purpose: To reduce overall GFLOPs while preserving generative quality
New design element introduced by the paper

pith-pipeline@v0.9.0 · 5446 in / 1159 out tokens · 34657 ms · 2026-05-14T23:43:05.354351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

early blocks operate on larger patches ... later blocks use smaller patches ... upsample module expands ... 50% reduction in GFLOPs
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FNO time embedding ... multi-token class embedding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 16 internal anchors

[1]

Dico: Revitalizing convnets for scalable and efficient diffusion modeling.arXiv preprint arXiv:2505.11196, 2025

Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, and Huaibo Huang. Dico: Revitalizing convnets for scalable and efficient diffusion modeling.arXiv preprint arXiv:2505.11196, 2025. 1, 2, 3, 7

work page arXiv 2025
[2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22669–22679, 2023. 1, 2, 3, 6

work page 2023
[3]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 3

work page 2021
[5]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 1

work page 2022
[6]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

work page
[7]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 1, 3

work page arXiv 2024
[8]

Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025. 1, 3

work page 2025
[9]

Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InForty-first International Conference on Machine Learning, 2024. 4

work page 2024
[10]

Flow matching in latent space.arXiv preprint arXiv:2307.08698,

Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space.arXiv preprint arXiv:2307.08698,

work page arXiv
[11]

A high- quality robust diffusion framework for corrupted dataset

Quan Dao, Binh Ta, Tung Pham, and Anh Tran. A high- quality robust diffusion framework for corrupted dataset. In European Conference on Computer Vision, pages 107–123. Springer, 2024. 1

work page 2024
[12]

Improved training technique for latent consistency models.arXiv preprint arXiv:2502.01441, 2025

Quan Dao, Khanh Doan, Di Liu, Trung Le, and Dimitris Metaxas. Improved training technique for latent consistency models.arXiv preprint arXiv:2502.01441, 2025. 1

work page arXiv 2025
[13]

Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984, 2025

Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas. Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984, 2025. 1

work page arXiv 2025
[14]

Self-corrected flow distillation for consistent one-step and few-step image generation

Quan Dao, Hao Phung, Trung Tuan Dao, Dimitris N Metaxas, and Anh Tran. Self-corrected flow distillation for consistent one-step and few-step image generation. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2654–2662, 2025. 1

work page 2025
[15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009. 6

work page 2009
[16]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 3, 6, 7

work page 2021
[17]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 1

work page 2021
[19]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[20]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 1, 3

work page 2014
[21]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 2, 3

work page 2024
[22]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Efficient diffu- sion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 7441–7451, 2023. 1, 3

work page 2023
[24]

Global context vision transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Global context vision transformers. InInternational Conference on Machine Learning, pages 12633–12646. PMLR, 2023. 2, 4

work page 2023
[25]

Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models.arXiv preprint arXiv:2410.08207, 2024

Xiaoxiao He, Quan Dao, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei- Xu, Chaowei Tan, et al. Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models.arXiv preprint arXiv:2410.08207, 2024. 1

work page arXiv 2024
[26]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, 2017. 6

work page 2017
[27]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3 9

work page 2020
[28]

Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1

work page 2022
[29]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 1

work page 2023
[30]

Zigma: A dit-style zigzag mamba diffusion model

Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Bj ¨orn Ommer. Zigma: A dit-style zigzag mamba diffusion model. InEuropean conference on computer vision, pages 148–166. Springer, 2024. 3

work page 2024
[31]

An edit friendly ddpm noise space: Inversion and manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12469– 12478, 2024. 1

work page 2024
[32]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 1

work page 2023
[33]

Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 1

work page 2018
[34]

Eq-vae: Equivariance regularized la- tent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized la- tent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025. 1, 3

work page arXiv 2025
[35]

Improved precision and recall met- ric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InAdvances in Neural Information Processing Systems, 2019. 6

work page 2019
[36]

Fourier Neural Operator for Parametric Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for para- metric partial differential equations.arXiv preprint arXiv:2010.08895, 2020. 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2010
[37]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1, 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 4

work page 2021
[39]

Dpm-solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps.Ad- vances in neural information processing systems, 35:5775– 5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps.Ad- vances in neural information processing systems, 35:5775– 5787, 2022. 1

work page 2022
[40]

Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Com- puter Vision, pages 23–40. Springer, 2024. 7, 1

work page 2024
[41]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023. 1

work page 2023
[43]

Swiftbrush: One-step text-to-image diffusion model with variational score distilla- tion

Thuan Hoang Nguyen and Anh Tran. Swiftbrush: One-step text-to-image diffusion model with variational score distilla- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7807–7816,

work page
[44]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Pixel Recurrent Neural Networks

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.arXiv preprint arXiv:1601.06759, 2016. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Sora: Text-to-video generation model, 2025

OpenAI. Sora: Text-to-video generation model, 2025. Video generation model with synchronized audio, released Septem- ber 30 2025. 1

work page 2025
[47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[49]

Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025

Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dim- itris Metaxas, and David Doermann. Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025. 1

work page arXiv 2025
[50]

Dimsum: Diffusion mamba–a scal- able and unified spatial-frequency method for image genera- tion.arXiv preprint arXiv:2411.04168, 2024

Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, and Anh Tran. Dimsum: Diffusion mamba–a scal- able and unified spatial-frequency method for image genera- tion.arXiv preprint arXiv:2411.04168, 2024. 2, 3, 7

work page arXiv 2024
[51]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3, 7

work page 2022
[53]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 1

work page 2023
[54]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 1 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Information Pro- cessing Systems, 2016. 6

work page 2016
[56]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 1

work page 2022
[57]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[59]

Contrastive flow match- ing.arXiv preprint arXiv:2506.05350, 2025

George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, and Judy Hoffman. Contrastive flow match- ing.arXiv preprint arXiv:2506.05350, 2025. 3

work page arXiv 2025
[60]

Dim: Diffusion mamba for efficient high-resolution image synthesis, 2024

Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis, 2024. 1, 2, 3, 7

work page 2024
[61]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 1

work page 2024
[62]

U-repa: Aligning diffu- sion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, and Yunhe Wang. U-repa: Aligning diffu- sion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025. 3

work page arXiv 2025
[63]

Dic: Rethinking conv3x3 de- signs in diffusion models

Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, and Hanting Chen. Dic: Rethinking conv3x3 de- signs in diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2469– 2478, 2025. 1, 2, 3, 7

work page 2025
[64]

Conditional image genera- tion with pixelcnn decoders.Advances in neural information processing systems, 29, 2016

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image genera- tion with pixelcnn decoders.Advances in neural information processing systems, 29, 2016. 1, 3

work page 2016
[65]

Anti-dreambooth: Pro- tecting users from personalized text-to-image synthesis

Thanh Van Le, Hao Phung, Thuan Hoang Nguyen, Quan Dao, Ngoc N Tran, and Anh Tran. Anti-dreambooth: Pro- tecting users from personalized text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2116–2127, 2023. 1

work page 2023
[66]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Lit: Delving into a simpli- fied linear diffusion transformer for image generation.arXiv preprint arXiv:2501.12976, 2025

Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, et al. Lit: Delving into a simpli- fied linear diffusion transformer for image generation.arXiv preprint arXiv:2501.12976, 2025. 1, 3

work page arXiv 2025
[68]

Diffuse and disperse: Image generation with representation regularization.arXiv preprint arXiv:2506.09027, 2025

Runqian Wang and Kaiming He. Diffuse and disperse: Im- age generation with representation regularization.arXiv preprint arXiv:2506.09027, 2025. 3

work page arXiv 2025
[69]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 1

work page 2023
[70]

Sana: Efficient high-resolution im- age synthesis with linear diffusion transformer, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution im- age synthesis with linear diffusion transformer, 2024. 1, 3

work page 2024
[71]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,

work page
[72]

Diffusion models without attention

Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffusion models without attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8239–8249, 2024. 2, 3, 7, 1

work page 2024
[73]

Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021. 2, 4

work page arXiv 2021
[74]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 1, 3

work page 2025
[76]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 1

work page 2024
[78]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 1

work page 2024
[79]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Con- ference on Learning Representations, 2025. 1, 3

work page 2025
[80]

Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 1, 3 11

work page arXiv 2024

Showing first 80 references.