Rethinking Cross-Layer Information Routing in Diffusion Transformers
Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3
The pith
Traditional residual addition in Diffusion Transformers creates information flow problems that learnable timestep-adaptive aggregation can fix.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that conventional residual addition in DiTs produces monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy, and that replacing it with Diffusion-Adaptive Routing—a learnable, timestep-adaptive, non-incremental aggregation over historical sublayer outputs—directly alleviates these symptoms, yielding an FID improvement from 9.67 to 7.56 on ImageNet 256×256 with SiT-XL/2 and matching baseline quality after 8.75× fewer iterations.
What carries the argument
Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs.
Load-bearing premise
The three symptoms of monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy are primarily caused by traditional residual addition and are directly alleviated by replacing it with learnable timestep-adaptive non-incremental aggregation.
What would settle it
Training identical DiT models with standard residuals versus DAR while plotting forward activation magnitudes, per-layer gradient norms, and pairwise block output similarities across timesteps would confirm whether the symptoms disappear and performance gains appear.
read the original abstract
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper performs a joint analysis of cross-layer information flow in Diffusion Transformers along depth and denoising timestep, diagnosing three symptoms of standard residual addition (monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy). It introduces Diffusion-Adaptive Routing (DAR) as a learnable, timestep-adaptive, non-incremental replacement for residual connections. On ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67), matches baseline quality with 8.75× fewer iterations, and yields 2× early-stage acceleration when stacked with REPA; it is also shown to preserve details in T2I fine-tuning and distillation.
Significance. If the central claims hold, the work identifies cross-layer routing as an orthogonal design axis to representation-alignment objectives such as REPA, with concrete quantitative gains (2.11 FID delta and nearly 9× iteration reduction) on a standard benchmark. The drop-in compatibility and applicability to both pretraining and fine-tuning stages would make the contribution practically relevant for scaling diffusion models.
major comments (3)
- [§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.
- [§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.
- [§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.
minor comments (2)
- The abstract and experimental claims would benefit from a brief statement of the number of random seeds and variance of the reported FID numbers.
- Notation for the DAR routing parameters could be introduced earlier and used consistently when describing compatibility with REPA.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important opportunities to strengthen the empirical validation of our claims. We will revise the manuscript to include post-training measurements of the diagnosed symptoms and explicit forward-pass comparisons. For the isolation of residual addition effects, we will expand the discussion while noting computational constraints on additional experiments.
read point-by-point responses
-
Referee: [§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.
Authors: We agree that direct post-training verification would strengthen the causal link. In the revised manuscript we will add measurements of forward magnitude inflation, backward gradient norms, and block-wise redundancy for both DAR and the SiT-XL/2 baseline after convergence. We will also include an ablation that varies the routing parameters and reports the correlation between symptom reduction and FID improvement. These results will appear in a new subsection of §5 and the appendix. revision: yes
-
Referee: [§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.
Authors: We will expand §4 with the exact equations for the timestep-adaptive routing weights and the aggregation operator. We will also add a new figure that plots layer-wise forward-pass magnitude trajectories for (i) standard residual addition, (ii) a learned scalar residual, and (iii) DAR. The comparison will illustrate that DAR prevents monotonic magnitude growth by performing non-incremental, history-dependent combination of sublayer outputs. revision: yes
-
Referee: [§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.
Authors: Our diagnosis rests on consistent empirical patterns observed across multiple DiT scales and training regimes. While controlled interventions such as residual scaling or alternative aggregations would provide stronger isolation, they require training additional large models from scratch. We will add a dedicated paragraph in §3 that discusses potential confounding factors (timestep conditioning, attention) and explicitly states the limitations of the current analysis. We believe this addresses the concern without new full-scale experiments. revision: partial
Circularity Check
No significant circularity; empirical method with independent parameters
full rationale
The paper's chain consists of empirical observation of three symptoms in standard residual streams, followed by introduction of a new learnable DAR module with timestep-adaptive non-incremental aggregation, and then experimental validation on ImageNet showing FID gains and faster convergence. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The new routing parameters are additional degrees of freedom rather than reparameterizations of quantities already present in the baseline, and all reported improvements are measured against external benchmarks (SiT-XL/2, REPA) outside the paper's own fitted values. This is the common case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- DAR routing parameters
axioms (1)
- domain assumption Traditional residual addition produces the three listed symptoms in DiTs
invented entities (1)
-
Diffusion-Adaptive Routing (DAR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Rezero is all you need: Fast convergence at large depth
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. InUncertainty in artificial intelligence, pages 1352–1361. PMLR, 2021
work page 2021
-
[2]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023
work page 2023
-
[3]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Xiaoye Qu, Tianlong Chen, and Yu Cheng. Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17708–17718, 2025
work page 2025
-
[5]
Sortblock: Similarity-aware feature reuse for diffusion model
Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, and Yi Liu. Sortblock: Similarity-aware feature reuse for diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2859–2867, 2026
work page 2026
-
[6]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
work page 2024
-
[8]
Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024
-
[9]
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024
-
[10]
Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025
Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025
-
[11]
Describe, don’t dictate: Semantic image editing with natural language intent
En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dictate: Semantic image editing with natural language intent. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025
work page 2025
-
[12]
Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024
Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 10
work page 2024
-
[14]
Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024
-
[15]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[17]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[20]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017
work page 2017
-
[21]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Jiachen Jiang, Jinxin Zhou, and Zhihui Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024
-
[23]
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022
work page 2022
-
[24]
Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021
work page 2021
-
[25]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019
work page 2019
-
[27]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[28]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025
-
[30]
arXiv preprint arXiv:2602.08064 , year=
Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, and Gao Huang. Siamesenorm: Breaking the barrier to reconciling pre/post-norm.arXiv preprint arXiv:2602.08064, 2026
work page internal anchor Pith review arXiv 2026
-
[31]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
work page 2024
-
[32]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, João Pedro Gan- darela de Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272694430
-
[34]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[36]
Shortgpt: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025
work page 2025
-
[37]
Generating images with sparse representations
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter Battaglia. Generating images with sparse representations. InInternational Conference on Machine Learning, pages 7958–7968. PMLR, 2021
work page 2021
-
[38]
Transformers without tears: Improving the normalization of self-attention
Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. InProceedings of the 16th international conference on spoken language translation, 2019
work page 2019
-
[39]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[40]
Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, and Martin Jaggi. Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024
work page 2024
-
[41]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[42]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[43]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
work page 2015
-
[44]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
work page 2016
-
[45]
Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025
work page 2025
-
[46]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[48]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[49]
Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks
Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InInternational Conference on Machine Learning, pages 46136–46155. PMLR, 2024
work page 2024
-
[50]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[51]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015. 12
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[52]
Ominicontrol: Minimal and universal control for diffusion transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025
work page 2025
-
[53]
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026
work page internal anchor Pith review arXiv 2026
-
[54]
Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, and Yunhe Wang. U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024
work page 2024
-
[55]
Going deeper with image transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021
work page 2021
-
[56]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024
work page 2024
-
[58]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
mHC: Manifold-Constrained Hyper-Connections
Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020
work page 2020
-
[62]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
work page 2025
-
[64]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024
work page 2024
-
[65]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
work page 2024
-
[66]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025
work page 2025
-
[67]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Easycontrol: Adding efficient and flexible control for diffusion transformer
Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025
work page 2025
-
[69]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 14 In the Appendix, we provide supplementary materials for our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers”, organized according to the corresponding sections in the main...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.