Recognition: 2 theorem links
· Lean TheoremMPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Pith reviewed 2026-05-14 23:43 UTC · model grok-4.3
The pith
Multi-patch global-to-local transformers halve computational cost in diffusion models while keeping generative performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a multi-patch global-to-local transformer architecture, in which early blocks operate on larger patches and later blocks operate on smaller patches, reduces computational cost by up to 50 percent in GFLOPs while achieving good generative performance on ImageNet for both diffusion and flow-matching models.
What carries the argument
The multi-patch global-to-local hierarchy that varies patch size across successive transformer blocks to capture global structure first and local detail later.
If this is right
- Diffusion and flow-matching models can be trained with substantially lower floating-point operations while retaining competitive image quality on ImageNet.
- Redesigned time and class embeddings accelerate training convergence beyond the savings from patch hierarchy alone.
- The same global-to-local patch progression applies directly to both diffusion and flow-matching training pipelines.
- Generative performance holds without extra architectural compensations once the patch-size schedule is set.
Where Pith is reading between the lines
- The design may transfer naturally to video or 3D generation where global context precedes local refinement.
- Combining the patch hierarchy with existing efficiency techniques such as pruning could produce additive savings.
- Inference cost may also drop because early blocks already operate on fewer tokens, though the paper focuses on training.
- Scaling the schedule to higher resolutions or larger models would test whether the 50 percent reduction holds proportionally.
Load-bearing premise
Switching patch sizes across blocks preserves the same generative quality and convergence behavior as a standard isotropic DiT without requiring additional compensatory changes.
What would settle it
A controlled comparison in which an isotropic DiT baseline with matched total compute or parameters achieves meaningfully better FID scores or faster convergence than MPDiT on ImageNet would falsify the efficiency claim.
Figures
read the original abstract
Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at: https://github.com/quandao10/MPDiT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MPDiT, a multi-patch global-to-local transformer for diffusion and flow-matching models. Early blocks operate on larger patches to capture coarse global context while later blocks use smaller patches for local refinement; the design is claimed to reduce GFLOPs by up to 50% relative to isotropic DiT while preserving generative performance on ImageNet. Improved time- and class-embedding schemes are also introduced to accelerate convergence. Code is released.
Significance. If the efficiency claims are substantiated under matched training budgets and model sizes, the hierarchical patch-size schedule would constitute a practical advance for scaling transformer-based generative models. The public code release is a clear strength that enables direct verification and extension.
major comments (2)
- [§3.2] §3.2 (Multi-Patch Blocks): the transition operator between large-patch (coarse-token) and small-patch (dense-token) blocks is not specified. No equation or diagram describes the required token reshaping, interpolation, or projection, nor is its parameter or FLOP overhead included in the reported GFLOPs figures. This mechanism is load-bearing for the central 50% reduction claim.
- [§4] §4 (Experiments): the abstract asserts “good generative performance” and a 50% GFLOPs saving, yet the provided text supplies no quantitative FID, IS, or precision-recall numbers, no matched-budget DiT baselines, and no ablation isolating the effect of the patch-size schedule from the embedding changes. Without these controls the efficiency claim cannot be evaluated.
minor comments (2)
- [Abstract] Abstract: grammatical error in “This hierarchical design could reduces computational cost”.
- [§3] Notation for patch-size schedule and token counts should be introduced with a single consistent symbol set (e.g., P_l for large-patch size) rather than repeated prose descriptions.
Simulated Author's Rebuttal
Thank you for the thorough review and constructive feedback on our manuscript. We appreciate the recognition of the potential practical advance offered by the hierarchical patch-size schedule and the value of the public code release. We address each major comment below and will revise the manuscript to incorporate the requested details and results.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Multi-Patch Blocks): the transition operator between large-patch (coarse-token) and small-patch (dense-token) blocks is not specified. No equation or diagram describes the required token reshaping, interpolation, or projection, nor is its parameter or FLOP overhead included in the reported GFLOPs figures. This mechanism is load-bearing for the central 50% reduction claim.
Authors: We agree that the transition mechanism requires a more precise specification. In the revised manuscript we will add explicit equations describing the token reshaping (via a learned linear projection to align feature dimensions) and any necessary spatial interpolation, together with a diagram of the block transition. We will also report the parameter count and FLOP overhead of the transition operator separately so that the overall GFLOPs reduction claim can be verified. revision: yes
-
Referee: [§4] §4 (Experiments): the abstract asserts “good generative performance” and a 50% GFLOPs saving, yet the provided text supplies no quantitative FID, IS, or precision-recall numbers, no matched-budget DiT baselines, and no ablation isolating the effect of the patch-size schedule from the embedding changes. Without these controls the efficiency claim cannot be evaluated.
Authors: We acknowledge that the current draft lacks the quantitative metrics, matched-budget baselines, and isolating ablations needed for rigorous evaluation. In the revision we will expand Section 4 with FID, IS, and precision-recall scores on ImageNet, direct comparisons against DiT models trained under identical compute budgets and parameter counts, and ablation tables that separately measure the contribution of the multi-patch schedule versus the improved embeddings. revision: yes
Circularity Check
No circularity: empirical claims rest on ImageNet experiments, not self-referential derivations
full rationale
The paper introduces a multi-patch hierarchical DiT variant with early large-patch blocks for global context and later small-patch blocks for local refinement, claiming up to 50% GFLOPs reduction. All performance assertions are framed as outcomes of direct experiments on ImageNet rather than predictions derived from fitted parameters or self-citations. No equations, ansatzes, or uniqueness theorems are presented that reduce by construction to the inputs; the transition mechanism between patch sizes is described at the architectural level without invoking prior self-work as load-bearing justification. The design is self-contained against external benchmarks, yielding a normal non-finding of circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer attention layers remain functional when input token count and spatial resolution change across blocks
invented entities (1)
-
Multi-patch global-to-local transformer blocks
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
early blocks operate on larger patches ... later blocks use smaller patches ... upsample module expands ... 50% reduction in GFLOPs
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FNO time embedding ... multi-token class embedding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, and Huaibo Huang. Dico: Revitalizing convnets for scalable and efficient diffusion modeling.arXiv preprint arXiv:2505.11196, 2025. 1, 2, 3, 7
-
[2]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22669–22679, 2023. 1, 2, 3, 6
work page 2023
-
[3]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 3
work page 2021
-
[5]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 1
work page 2022
-
[6]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
-
[7]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 1, 3
-
[8]
Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space
Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025. 1, 3
work page 2025
-
[9]
Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InForty-first International Conference on Machine Learning, 2024. 4
work page 2024
-
[10]
Flow matching in latent space.arXiv preprint arXiv:2307.08698,
Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space.arXiv preprint arXiv:2307.08698,
-
[11]
A high- quality robust diffusion framework for corrupted dataset
Quan Dao, Binh Ta, Tung Pham, and Anh Tran. A high- quality robust diffusion framework for corrupted dataset. In European Conference on Computer Vision, pages 107–123. Springer, 2024. 1
work page 2024
-
[12]
Improved training technique for latent consistency models.arXiv preprint arXiv:2502.01441, 2025
Quan Dao, Khanh Doan, Di Liu, Trung Le, and Dimitris Metaxas. Improved training technique for latent consistency models.arXiv preprint arXiv:2502.01441, 2025. 1
-
[13]
Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas. Discrete noise inversion for next-scale autoregressive text-based image editing.arXiv preprint arXiv:2509.01984, 2025. 1
-
[14]
Self-corrected flow distillation for consistent one-step and few-step image generation
Quan Dao, Hao Phung, Trung Tuan Dao, Dimitris N Metaxas, and Anh Tran. Self-corrected flow distillation for consistent one-step and few-step image generation. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2654–2662, 2025. 1
work page 2025
-
[15]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009. 6
work page 2009
-
[16]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 3, 6, 7
work page 2021
-
[17]
Density estimation using Real NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016. 1
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 1
work page 2021
-
[19]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[20]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 1, 3
work page 2014
-
[21]
Mamba: Linear-time sequence mod- eling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 2, 3
work page 2024
-
[22]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Efficient diffu- sion training via min-snr weighting strategy
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 7441–7451, 2023. 1, 3
work page 2023
-
[24]
Global context vision transformers
Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Global context vision transformers. InInternational Conference on Machine Learning, pages 12633–12646. PMLR, 2023. 2, 4
work page 2023
-
[25]
Xiaoxiao He, Quan Dao, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei- Xu, Chaowei Tan, et al. Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models.arXiv preprint arXiv:2410.08207, 2024. 1
-
[26]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, 2017. 6
work page 2017
-
[27]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3 9
work page 2020
-
[28]
Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1
work page 2022
-
[29]
sim- ple diffusion: End-to-end diffusion for high resolution im- ages
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 1
work page 2023
-
[30]
Zigma: A dit-style zigzag mamba diffusion model
Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Bj ¨orn Ommer. Zigma: A dit-style zigzag mamba diffusion model. InEuropean conference on computer vision, pages 148–166. Springer, 2024. 3
work page 2024
-
[31]
An edit friendly ddpm noise space: Inversion and manipulations
Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12469– 12478, 2024. 1
work page 2024
-
[32]
Understanding diffu- sion objectives as the elbo with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 1
work page 2023
-
[33]
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 1
work page 2018
-
[34]
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized la- tent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025. 1, 3
-
[35]
Improved precision and recall met- ric for assessing generative models
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InAdvances in Neural Information Processing Systems, 2019. 6
work page 2019
-
[36]
Fourier Neural Operator for Parametric Partial Differential Equations
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for para- metric partial differential equations.arXiv preprint arXiv:2010.08895, 2020. 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[37]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1, 3, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 4
work page 2021
-
[39]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps.Ad- vances in neural information processing systems, 35:5775– 5787, 2022. 1
work page 2022
-
[40]
Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Com- puter Vision, pages 23–40. Springer, 2024. 7, 1
work page 2024
-
[41]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
On distillation of guided diffusion models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023. 1
work page 2023
-
[43]
Swiftbrush: One-step text-to-image diffusion model with variational score distilla- tion
Thuan Hoang Nguyen and Anh Tran. Swiftbrush: One-step text-to-image diffusion model with variational score distilla- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7807–7816,
-
[44]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Pixel Recurrent Neural Networks
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.arXiv preprint arXiv:1601.06759, 2016. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[46]
Sora: Text-to-video generation model, 2025
OpenAI. Sora: Text-to-video generation model, 2025. Video generation model with synchronized audio, released Septem- ber 30 2025. 1
work page 2025
-
[47]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[49]
Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025
Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dim- itris Metaxas, and David Doermann. Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025. 1
-
[50]
Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, and Anh Tran. Dimsum: Diffusion mamba–a scal- able and unified spatial-frequency method for image genera- tion.arXiv preprint arXiv:2411.04168, 2024. 2, 3, 7
-
[51]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3, 7
work page 2022
-
[53]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 1
work page 2023
-
[54]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 1 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Information Pro- cessing Systems, 2016. 6
work page 2016
-
[56]
Stylegan- xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 1
work page 2022
-
[57]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[59]
Contrastive flow match- ing.arXiv preprint arXiv:2506.05350, 2025
George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, and Judy Hoffman. Contrastive flow match- ing.arXiv preprint arXiv:2506.05350, 2025. 3
-
[60]
Dim: Diffusion mamba for efficient high-resolution image synthesis, 2024
Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis, 2024. 1, 2, 3, 7
work page 2024
-
[61]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 1
work page 2024
-
[62]
U-repa: Aligning diffu- sion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025
Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, and Yunhe Wang. U-repa: Aligning diffu- sion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025. 3
-
[63]
Dic: Rethinking conv3x3 de- signs in diffusion models
Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, and Hanting Chen. Dic: Rethinking conv3x3 de- signs in diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2469– 2478, 2025. 1, 2, 3, 7
work page 2025
-
[64]
Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image genera- tion with pixelcnn decoders.Advances in neural information processing systems, 29, 2016. 1, 3
work page 2016
-
[65]
Anti-dreambooth: Pro- tecting users from personalized text-to-image synthesis
Thanh Van Le, Hao Phung, Thuan Hoang Nguyen, Quan Dao, Ngoc N Tran, and Anh Tran. Anti-dreambooth: Pro- tecting users from personalized text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2116–2127, 2023. 1
work page 2023
-
[66]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, et al. Lit: Delving into a simpli- fied linear diffusion transformer for image generation.arXiv preprint arXiv:2501.12976, 2025. 1, 3
-
[68]
Runqian Wang and Kaiming He. Diffuse and disperse: Im- age generation with representation regularization.arXiv preprint arXiv:2506.09027, 2025. 3
-
[69]
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 1
work page 2023
-
[70]
Sana: Efficient high-resolution im- age synthesis with linear diffusion transformer, 2024
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution im- age synthesis with linear diffusion transformer, 2024. 1, 3
work page 2024
-
[71]
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer,
-
[72]
Diffusion models without attention
Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffusion models without attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8239–8249, 2024. 2, 3, 7, 1
work page 2024
-
[73]
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021. 2, 4
-
[74]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 1, 3
work page 2025
-
[76]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 1
work page 2024
-
[78]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 1
work page 2024
-
[79]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Con- ference on Learning Representations, 2025. 1, 3
work page 2025
-
[80]
Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024
Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 1, 3 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.