Rethinking Cross-Layer Information Routing in Diffusion Transformers

Chao Xu; Cuifeng Shen; Hanlin Tang; Kan Liu; Lin Qu; Maohua Li; Qirui Li; Shao-Qun Zhang; Tao Lan; Yanke Zhou

arxiv: 2605.20708 · v1 · pith:CQB2RFTPnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Chao Xu , Maohua Li , Qirui Li , Yixuan Xu , Yanke Zhou , Yunhe Li , Cuifeng Shen , Hanlin Tang

show 4 more authors

Kan Liu Tao Lan Lin Qu Shao-Qun Zhang

This is my paper

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Diffusion TransformersResidual ConnectionsCross-Layer RoutingDiffusion-Adaptive RoutingImage GenerationTraining EfficiencySiTREPA

0 comments

The pith

Traditional residual addition in Diffusion Transformers creates information flow problems that learnable timestep-adaptive aggregation can fix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion Transformers inherit residual streams from standard Transformers, but a systematic analysis along depth and denoising timestep reveals three concrete symptoms: monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. The paper proposes Diffusion-Adaptive Routing as a drop-in replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. This change improves generation quality on ImageNet and accelerates convergence, while remaining compatible with existing enhancements such as REPA. A sympathetic reader would care because the work identifies cross-layer information routing as an underexplored axis that operates orthogonally to representation-alignment objectives.

Core claim

The paper establishes that conventional residual addition in DiTs produces monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy, and that replacing it with Diffusion-Adaptive Routing—a learnable, timestep-adaptive, non-incremental aggregation over historical sublayer outputs—directly alleviates these symptoms, yielding an FID improvement from 9.67 to 7.56 on ImageNet 256×256 with SiT-XL/2 and matching baseline quality after 8.75× fewer iterations.

What carries the argument

Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs.

Load-bearing premise

The three symptoms of monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy are primarily caused by traditional residual addition and are directly alleviated by replacing it with learnable timestep-adaptive non-incremental aggregation.

What would settle it

Training identical DiT models with standard residuals versus DAR while plotting forward activation magnitudes, per-layer gradient norms, and pairwise block output similarities across timesteps would confirm whether the symptoms disappear and performance gains appear.

read the original abstract

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper performs a joint analysis of cross-layer information flow in Diffusion Transformers along depth and denoising timestep, diagnosing three symptoms of standard residual addition (monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy). It introduces Diffusion-Adaptive Routing (DAR) as a learnable, timestep-adaptive, non-incremental replacement for residual connections. On ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67), matches baseline quality with 8.75× fewer iterations, and yields 2× early-stage acceleration when stacked with REPA; it is also shown to preserve details in T2I fine-tuning and distillation.

Significance. If the central claims hold, the work identifies cross-layer routing as an orthogonal design axis to representation-alignment objectives such as REPA, with concrete quantitative gains (2.11 FID delta and nearly 9× iteration reduction) on a standard benchmark. The drop-in compatibility and applicability to both pretraining and fine-tuning stages would make the contribution practically relevant for scaling diffusion models.

major comments (3)

[§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.
[§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.
[§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.

minor comments (2)

The abstract and experimental claims would benefit from a brief statement of the number of random seeds and variance of the reported FID numbers.
Notation for the DAR routing parameters could be introduced earlier and used consistently when describing compatibility with REPA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important opportunities to strengthen the empirical validation of our claims. We will revise the manuscript to include post-training measurements of the diagnosed symptoms and explicit forward-pass comparisons. For the isolation of residual addition effects, we will expand the discussion while noting computational constraints on additional experiments.

read point-by-point responses

Referee: [§5] §5 (Experimental Results): The manuscript reports FID and iteration-count improvements but provides no post-training measurements of the three diagnosed symptoms (forward magnitude, backward gradient norms, or block-wise redundancy) under DAR versus the SiT-XL/2 baseline, nor any correlation or ablation linking the degree of symptom reduction to the observed 2.11 FID gain. This leaves the causal mechanism unverified.

Authors: We agree that direct post-training verification would strengthen the causal link. In the revised manuscript we will add measurements of forward magnitude inflation, backward gradient norms, and block-wise redundancy for both DAR and the SiT-XL/2 baseline after convergence. We will also include an ablation that varies the routing parameters and reports the correlation between symptom reduction and FID improvement. These results will appear in a new subsection of §5 and the appendix. revision: yes
Referee: [§4] §4 (Method): The claim that DAR performs 'non-incremental' aggregation is load-bearing for the diagnosis, yet the precise formulation of the routing parameters and how they enforce non-incrementality versus a simple learned residual is not accompanied by an explicit comparison of the resulting forward-pass magnitude trajectories.

Authors: We will expand §4 with the exact equations for the timestep-adaptive routing weights and the aggregation operator. We will also add a new figure that plots layer-wise forward-pass magnitude trajectories for (i) standard residual addition, (ii) a learned scalar residual, and (iii) DAR. The comparison will illustrate that DAR prevents monotonic magnitude growth by performing non-incremental, history-dependent combination of sublayer outputs. revision: yes
Referee: [§3] §3 (Diagnosis): The three symptoms are presented as primarily caused by traditional residual addition, but the analysis does not include controlled interventions (e.g., scaling the residual coefficient or using alternative aggregations) to isolate residual addition from other DiT architectural factors such as timestep conditioning or attention patterns.

Authors: Our diagnosis rests on consistent empirical patterns observed across multiple DiT scales and training regimes. While controlled interventions such as residual scaling or alternative aggregations would provide stronger isolation, they require training additional large models from scratch. We will add a dedicated paragraph in §3 that discusses potential confounding factors (timestep conditioning, attention) and explicitly states the limitations of the current analysis. We believe this addresses the concern without new full-scale experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent parameters

full rationale

The paper's chain consists of empirical observation of three symptoms in standard residual streams, followed by introduction of a new learnable DAR module with timestep-adaptive non-incremental aggregation, and then experimental validation on ImageNet showing FID gains and faster convergence. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The new routing parameters are additional degrees of freedom rather than reparameterizations of quantities already present in the baseline, and all reported improvements are measured against external benchmarks (SiT-XL/2, REPA) outside the paper's own fitted values. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on empirical effectiveness of a new learnable module; the main unverified premise is that the diagnosed symptoms are fixable by non-incremental aggregation without side effects.

free parameters (1)

DAR routing parameters
Learnable weights that control timestep-adaptive aggregation of sublayer outputs.

axioms (1)

domain assumption Traditional residual addition produces the three listed symptoms in DiTs
This premise motivates the replacement and is derived from the paper's empirical diagnosis.

invented entities (1)

Diffusion-Adaptive Routing (DAR) no independent evidence
purpose: Perform learnable, timestep-adaptive, non-incremental aggregation over sublayer history
New architectural component introduced to replace residual addition.

pith-pipeline@v0.9.0 · 5873 in / 1426 out tokens · 51756 ms · 2026-05-21T05:26:38.226889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 22 internal anchors

[1]

Rezero is all you need: Fast convergence at large depth

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. InUncertainty in artificial intelligence, pages 1352–1361. PMLR, 2021

work page 2021
[2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023
[3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints

Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Xiaoye Qu, Tianlong Chen, and Yu Cheng. Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17708–17718, 2025

work page 2025
[5]

Sortblock: Similarity-aware feature reuse for diffusion model

Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, and Yi Liu. Sortblock: Similarity-aware feature reuse for diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2859–2867, 2026

work page 2026
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024
[8]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

work page arXiv 2024
[9]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024
[10]

Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

work page arXiv 2025
[11]

Describe, don’t dictate: Semantic image editing with natural language intent

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dictate: Semantic image editing with natural language intent. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025

work page 2025
[12]

Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024
[13]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 10

work page 2024
[14]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

work page arXiv 2024
[15]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[20]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[21]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

Jiachen Jiang, Jinxin Zhou, and Zhihui Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

work page arXiv 2024
[23]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

work page 2022
[24]

Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

work page 2021
[25]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019
[27]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[28]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[30]

arXiv preprint arXiv:2602.08064 , year=

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, and Gao Huang. Siamesenorm: Breaking the barrier to reconciling pre/post-norm.arXiv preprint arXiv:2602.08064, 2026

work page internal anchor Pith review arXiv 2026
[31]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

work page 2024
[32]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, João Pedro Gan- darela de Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272694430

work page arXiv 2024
[34]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024
[36]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

work page 2025
[37]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter Battaglia. Generating images with sparse representations. InInternational Conference on Machine Learning, pages 7958–7968. PMLR, 2021

work page 2021
[38]

Transformers without tears: Improving the normalization of self-attention

Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. InProceedings of the 16th international conference on spoken language translation, 2019

work page 2019
[39]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[40]

Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, and Martin Jaggi. Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

work page 2024
[41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[42]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[43]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

work page 2015
[44]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016
[45]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

work page 2025
[46]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015
[48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InInternational Conference on Machine Learning, pages 46136–46155. PMLR, 2024

work page 2024
[50]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[51]

Highway Networks

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015. 12

work page internal anchor Pith review Pith/arXiv arXiv 2015
[52]

Ominicontrol: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025

work page 2025
[53]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026
[54]

U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, and Yunhe Wang. U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

work page 2024
[55]

Going deeper with image transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021

work page 2021
[56]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

work page 2024
[58]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

work page 2020
[62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025
[64]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[65]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[66]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

work page 2025
[67]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Easycontrol: Adding efficient and flexible control for diffusion transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025

work page 2025
[69]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 14 In the Appendix, we provide supplementary materials for our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers”, organized according to the corresponding sections in the main...

work page arXiv 2024

[1] [1]

Rezero is all you need: Fast convergence at large depth

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. InUncertainty in artificial intelligence, pages 1352–1361. PMLR, 2021

work page 2021

[2] [2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023

[3] [3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints

Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Xiaoye Qu, Tianlong Chen, and Yu Cheng. Towards stabilized and efficient diffusion transformers through long-skip-connections with spectral constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17708–17718, 2025

work page 2025

[5] [5]

Sortblock: Similarity-aware feature reuse for diffusion model

Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, and Yi Liu. Sortblock: Similarity-aware feature reuse for diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2859–2867, 2026

work page 2026

[6] [6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024

[8] [8]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

work page arXiv 2024

[9] [9]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024

[10] [10]

Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

work page arXiv 2025

[11] [11]

Describe, don’t dictate: Semantic image editing with natural language intent

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, and Ying Tai. Describe, don’t dictate: Semantic image editing with natural language intent. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19185–19194, 2025

work page 2025

[12] [12]

Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024

[13] [13]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 10

work page 2024

[14] [14]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

work page arXiv 2024

[15] [15]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[17] [17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[18] [18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[20] [20]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017

[21] [21]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

Jiachen Jiang, Jinxin Zhou, and Zhihui Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

work page arXiv 2024

[23] [23]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

work page 2022

[24] [24]

Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

work page 2021

[25] [25]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019

[27] [27]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[28] [28]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[30] [30]

arXiv preprint arXiv:2602.08064 , year=

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, and Gao Huang. Siamesenorm: Breaking the barrier to reconciling pre/post-norm.arXiv preprint arXiv:2602.08064, 2026

work page internal anchor Pith review arXiv 2026

[31] [31]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

work page 2024

[32] [32]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, João Pedro Gan- darela de Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep- fusion large language models.ArXiv, abs/2409.10695, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272694430

work page arXiv 2024

[34] [34]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024

[36] [36]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

work page 2025

[37] [37]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter Battaglia. Generating images with sparse representations. InInternational Conference on Machine Learning, pages 7958–7968. PMLR, 2021

work page 2021

[38] [38]

Transformers without tears: Improving the normalization of self-attention

Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. InProceedings of the 16th international conference on spoken language translation, 2019

work page 2019

[39] [39]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021

[40] [40]

Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, and Martin Jaggi. Denseformer: Enhancing information flow in transformers via depth weighted averaging.Advances in neural information processing systems, 37:136479–136508, 2024

work page 2024

[41] [41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[42] [42]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[43] [43]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

work page 2015

[44] [44]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016

[45] [45]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

work page 2025

[46] [46]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015

[48] [48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InInternational Conference on Machine Learning, pages 46136–46155. PMLR, 2024

work page 2024

[50] [50]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[51] [51]

Highway Networks

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015. 12

work page internal anchor Pith review Pith/arXiv arXiv 2015

[52] [52]

Ominicontrol: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025

work page 2025

[53] [53]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026

[54] [54]

U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, and Yunhe Wang. U-dits: Downsample tokens in u-shaped diffusion transformers.Advances in Neural Information Processing Systems, 37:51994–52013, 2024

work page 2024

[55] [55]

Going deeper with image transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021

work page 2021

[56] [56]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

work page 2024

[58] [58]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

work page 2020

[62] [62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025

[64] [64]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[65] [65]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[66] [66]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

work page 2025

[67] [67]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Easycontrol: Adding efficient and flexible control for diffusion transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025

work page 2025

[69] [69]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 14 In the Appendix, we provide supplementary materials for our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers”, organized according to the corresponding sections in the main...

work page arXiv 2024