pith. sign in

arxiv: 2605.20708 · v1 · pith:CQB2RFTPnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Diffusion TransformersResidual ConnectionsCross-Layer RoutingDiffusion-Adaptive RoutingImage GenerationTraining EfficiencySiTREPA
0
0 comments X

The pith

Traditional residual addition in Diffusion Transformers creates information flow problems that learnable timestep-adaptive aggregation can fix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion Transformers inherit residual streams from standard Transformers, but a systematic analysis along depth and denoising timestep reveals three concrete symptoms: monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. The paper proposes Diffusion-Adaptive Routing as a drop-in replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. This change improves generation quality on ImageNet and accelerates convergence, while remaining compatible with existing enhancements such as REPA. A sympathetic reader would care because the work identifies cross-layer information routing as an underexplored axis that operates orthogonally to representation-alignment objectives.

Core claim

The paper establishes that conventional residual addition in DiTs produces monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy, and that replacing it with Diffusion-Adaptive Routing—a learnable, timestep-adaptive, non-incremental aggregation over historical sublayer outputs—directly alleviates these symptoms, yielding an FID improvement from 9.67 to 7.56 on ImageNet 256×256 with SiT-XL/2 and matching baseline quality after 8.75× fewer iterations.

What carries the argument

Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs.

Load-bearing premise

The three symptoms of monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy are primarily caused by traditional residual addition and are directly alleviated by replacing it with learnable timestep-adaptive non-incremental aggregation.

What would settle it

Training identical DiT models with standard residuals versus DAR while plotting forward activation magnitudes, per-layer gradient norms, and pairwise block output similarities across timesteps would confirm whether the symptoms disappear and performance gains appear.

read the original abstract

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper performs a joint analysis of cross-layer information flow in Diffusion Transformers along depth and denoising timestep, diagnosing three symptoms of standard residual addition (monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy). It introduces Diffusion-Adaptive Routing (DAR) as a learnable, timestep-adaptive, non-incremental replacement for residual connections. On ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67), matches baseline quality with 8.75× fewer iterations, and yields 2× early-stage acceleration when stacked with REPA; it is also shown to preserve details in T2I fine-tuning and distillation.

Significance. If the central claims hold, the work identifies cross-layer routing as an orthogonal design axis to representation-alignment objectives such as REPA, with concrete quantitative gains (2.11 FID delta and nearly 9× iteration reduction) on a standard benchmark. The drop-in compatibility and applicability to both pretraining and fine-tuning stages would make the contribution practically relevant for scaling diffusion models.

major comments (3)
  1. [§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.
  2. [§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.
  3. [§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.
minor comments (2)
  1. The abstract and experimental claims would benefit from a brief statement of the number of random seeds and variance of the reported FID numbers.
  2. Notation for the DAR routing parameters could be introduced earlier and used consistently when describing compatibility with REPA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important opportunities to strengthen the empirical validation of our claims. We will revise the manuscript to include post-training measurements of the diagnosed symptoms and explicit forward-pass comparisons. For the isolation of residual addition effects, we will expand the discussion while noting computational constraints on additional experiments.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.

    Authors: We agree that direct post-training verification would strengthen the causal link. In the revised manuscript we will add measurements of forward magnitude inflation, backward gradient norms, and block-wise redundancy for both DAR and the SiT-XL/2 baseline after convergence. We will also include an ablation that varies the routing parameters and reports the correlation between symptom reduction and FID improvement. These results will appear in a new subsection of §5 and the appendix. revision: yes

  2. Referee: [§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.

    Authors: We will expand §4 with the exact equations for the timestep-adaptive routing weights and the aggregation operator. We will also add a new figure that plots layer-wise forward-pass magnitude trajectories for (i) standard residual addition, (ii) a learned scalar residual, and (iii) DAR. The comparison will illustrate that DAR prevents monotonic magnitude growth by performing non-incremental, history-dependent combination of sublayer outputs. revision: yes

  3. Referee: [§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.

    Authors: Our diagnosis rests on consistent empirical patterns observed across multiple DiT scales and training regimes. While controlled interventions such as residual scaling or alternative aggregations would provide stronger isolation, they require training additional large models from scratch. We will add a dedicated paragraph in §3 that discusses potential confounding factors (timestep conditioning, attention) and explicitly states the limitations of the current analysis. We believe this addresses the concern without new full-scale experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent parameters

full rationale

The paper's chain consists of empirical observation of three symptoms in standard residual streams, followed by introduction of a new learnable DAR module with timestep-adaptive non-incremental aggregation, and then experimental validation on ImageNet showing FID gains and faster convergence. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The new routing parameters are additional degrees of freedom rather than reparameterizations of quantities already present in the baseline, and all reported improvements are measured against external benchmarks (SiT-XL/2, REPA) outside the paper's own fitted values. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on empirical effectiveness of a new learnable module; the main unverified premise is that the diagnosed symptoms are fixable by non-incremental aggregation without side effects.

free parameters (1)
  • DAR routing parameters
    Learnable weights that control timestep-adaptive aggregation of sublayer outputs.
axioms (1)
  • domain assumption Traditional residual addition produces the three listed symptoms in DiTs
    This premise motivates the replacement and is derived from the paper's empirical diagnosis.
invented entities (1)
  • Diffusion-Adaptive Routing (DAR) no independent evidence
    purpose: Perform learnable, timestep-adaptive, non-incremental aggregation over sublayer history
    New architectural component introduced to replace residual addition.

pith-pipeline@v0.9.0 · 5873 in / 1426 out tokens · 51756 ms · 2026-05-21T05:26:38.226889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 22 internal anchors

  1. [1]

    Rezero is all you need: Fast convergence at large depth

    Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. InUncertainty in artificial intelligence, pages 1352–1361. PMLR, 2021

  2. [2]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  4. [4]

    Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints

    Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Xiaoye Qu, Tianlong Chen, and Yu Cheng. Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17708–17718, 2025

  5. [5]

    Sortblock: Similarity-aware feature reuse for diffusion model

    Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, and Yi Liu. Sortblock: Similarity-aware feature reuse for diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2859–2867, 2026

  6. [6]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

  7. [7]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  8. [8]

    Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

    Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

  9. [9]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

  10. [10]

    Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

    Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

  11. [11]

    Describe, don’t dictate: Semantic image editing with natural language intent

    En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dictate: Semantic image editing with natural language intent. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025

  12. [12]

    Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

    Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

  13. [13]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 10

  14. [14]

    Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

    Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

  15. [15]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

  16. [16]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  17. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  18. [18]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  19. [19]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  20. [20]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  21. [21]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  22. [22]

    Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

    Jiachen Jiang, Jinxin Zhou, and Zhihui Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

  23. [23]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

  24. [24]

    Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

  25. [25]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  26. [26]

    Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

  27. [27]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  28. [28]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  29. [29]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

  30. [30]

    arXiv preprint arXiv:2602.08064 , year=

    Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, and Gao Huang. Siamesenorm: Breaking the barrier to reconciling pre/post-norm.arXiv preprint arXiv:2602.08064, 2026

  31. [31]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  32. [32]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 11

  33. [33]

    Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, João Pedro Gan- darela de Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272694430

  34. [34]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  35. [35]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  36. [36]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

  37. [37]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter Battaglia. Generating images with sparse representations. InInternational Conference on Machine Learning, pages 7958–7968. PMLR, 2021

  38. [38]

    Transformers without tears: Improving the normalization of self-attention

    Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. InProceedings of the 16th international conference on spoken language translation, 2019

  39. [39]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

  40. [40]

    Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

    Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, and Martin Jaggi. Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

  41. [41]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  42. [42]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  43. [43]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  44. [44]

    Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

  45. [45]

    A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

    Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

  46. [46]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  47. [47]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  48. [48]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  49. [49]

    Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

    Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InInternational Conference on Machine Learning, pages 46136–46155. PMLR, 2024

  50. [50]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  51. [51]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015. 12

  52. [52]

    Ominicontrol: Minimal and universal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025

  53. [53]

    Attention Residuals

    Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

  54. [54]

    U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

    Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, and Yunhe Wang. U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

  55. [55]

    Going deeper with image transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021

  56. [56]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  57. [57]

    Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

  58. [58]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  59. [59]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

  60. [60]

    mHC: Manifold-Constrained Hyper-Connections

    Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

  61. [61]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

  62. [62]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  63. [63]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  64. [64]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  65. [65]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  66. [66]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

  67. [67]

    Test-Time Training Done Right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

  68. [68]

    Easycontrol: Adding efficient and flexible control for diffusion transformer

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025

  69. [69]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13

  70. [70]

    Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

    Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 14 In the Appendix, we provide supplementary materials for our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers”, organized according to the corresponding sections in the main...