pith. sign in

arxiv: 2605.06169 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

Pith reviewed 2026-05-08 13:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords diffusion transformersdeep residual networksmean mode screamingmean-variance split residualsgradient decompositionstable trainingscaling depthgenerative models
0
0 comments X

The pith

Mean-Variance Split residuals stop mean-mode collapse and let Diffusion Transformers train stably at 1000 layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Mean Mode Screaming as the trigger for a silent collapse in deep Diffusion Transformers, where residual branches become mean-dominated and suppress centered token variation even while training metrics look stable. It isolates this through an exact gradient decomposition into mean-coherent and centered parts, worsened by the null space of the softmax Jacobian once values homogenize. To counter it, the authors introduce Mean-Variance Split residuals that apply a separately gained centered update together with a leaky trunk-mean replacement. On a 400-layer single-stream DiT this prevents the crash that hits the baseline while staying close to the pre-crash path and beating LayerScale. The same fix is then shown to keep a 1000-layer model trainable at extreme depth.

Core claim

Networks enter a mean-dominated collapse state called Mean Mode Screaming when an exact decomposition of gradients into mean-coherent backward shocks and centered components interacts with structural suppression of attention-logit gradients through the null space of the Softmax Jacobian; Mean-Variance Split residuals counteract the collapse by combining a separately gained centered residual update with a leaky trunk-mean replacement, so that a 400-layer DiT avoids divergent failure and a 1000-layer DiT remains stably trainable.

What carries the argument

Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement to preserve both mean and variance dynamics.

If this is right

  • A 400-layer single-stream DiT avoids the divergent collapse that crashes the baseline.
  • The stabilized model tracks the baseline trajectory up to the crash point while outperforming token-isotropic methods like LayerScale over the full schedule.
  • A 1000-layer DiT remains stably trainable, confirming the architecture works at boundary scales.
  • MMS can be detected and blocked even when training initially appears stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split-residual pattern could be tested in other residual-heavy generative architectures to see whether mean-mode collapse is a general depth limit.
  • Scaling studies for diffusion models could now treat extreme depth as a controllable variable once MMS is mitigated.
  • If MV-Split preserves variance flow, it may also improve sample diversity in very deep models compared with mean-only stabilizations.

Load-bearing premise

The stability seen in the 1000-layer run is produced by the MV-Split residuals rather than other unstated training choices or baseline stabilizations.

What would settle it

Train the identical 1000-layer DiT without MV-Split residuals and check whether it enters the same mean-dominated collapse and crashes as the unstabilized baseline.

Figures

Figures reproduced from arXiv: 2605.06169 by Pengqi Lu.

Figure 1
Figure 1. Figure 1: Text-to-image generation samples from our 1000-layer MV-Split DiT. More samples are provided in Appendix M. Code: https://github.com/erwold/mv-split. Model weights: https://huggingface.co/StableKirito/mvsplit-dit-1000l. Abstract Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token… view at source ↗
Figure 2
Figure 2. Figure 2: Baseline DiT and representative training diagnostics. (Left) Single-stream DiT backbone. (Middle) Training loss over the first 10k steps for the un-stabilized 400-layer baseline and the MV￾Split 400-/1000-layer runs. (Right) Per-layer energy ratio ρT = ∥µ(X)∥F /∥c(X)∥F (Appendix A) across L0–L384 in a baseline run. 2.1 Minimal Single-Stream Multi-Modal Diffusion Transformer We use a deliberately stripped-d… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical trajectory of a representative divergence event (400-layer). The vertical dashed line marks the divergence step. (a–c) Backward trigger: The global gradient norm spikes (a). The spike is concentrated in the mean-coherent gradient component Gmean, while the centered component Gctr shows no comparable amplification (b). After the spike, Q/K gradients drop by roughly four orders of magnitude while W… view at source ↗
Figure 4
Figure 4. Figure 4: Writer amplification at the gradient spike (400-layer Base η run, t ⋆=3400, measured on the T=256 image-token segment). Each point plots A − 1 against the equal-magnitude absolute-coherence upper-envelope proxy (T − 1)ˆκ for (a) Attn_WO and (b) FFN_W2. Gray points are pre-spike layer-step samples; colored points are active layers at t ⋆ (A − 1 > 0.5). The dashed line is the absolute-coherence saturation en… view at source ↗
Figure 5
Figure 5. Figure 5: Quality and optimizer stability over 80k steps (ImageNet 256×256). (Top) FID-50K and Inception Score. (Bottom) Post-clipping global gradient norm. The 400-layer curves define the controlled comparison: among the non-divergent 400-layer runs, MV-Split preserves a higher bounded gradient band than LayerScale while avoiding the spikes of the un-stabilized baselines. The 1000-layer MV-Split trace is included a… view at source ↗
Figure 6
Figure 6. Figure 6: Residual-writer gradient mode decomposition. Per-step median across depth of the mean-coherent (Gmean, left) and centered (Gctr, right) writer￾gradient magnitudes; shaded regions denote the interquartile range (IQR; 25– 75% across depth). Token-isotropic per-channel gating compresses both modes; MV-Split bounds the mean-coherent component while preserving a higher, stable centered band. The convergence cur… view at source ↗
Figure 7
Figure 7. Figure 7: Standard-initialization control for a 128-layer DiT. (a) Token cosine similarity (TCS) over training steps and layer depth. The dashed white contour marks TCS = 0.9. (b) Depth profiles of centered retention Ret(c←c), centered branch replenishment VarGain, and attention row diversity RowDiv; curves report the median over diagnostic checkpoints from steps 10–690. (c) Median writer-gradient decomposition for … view at source ↗
Figure 8
Figure 8. Figure 8: Step-level gradient trace pipeline. A global-norm threshold (1) triggers a per-family top-K ranking of distributed gradient norms (2) and a cross-rank exclusion audit (3) that checks per-rank loss agreement, final-output-gradient RMS, and NaN/Inf in stored parameters. When all three exclusions pass, the dominant top-K parameter family at the detected step (4) is recorded for the gradient-mode audit and sub… view at source ↗
Figure 9
Figure 9. Figure 9: Step-level gradient trace at a representative spike (400 layers, Base η/2). (Left) Top parameter-family gradient norms Gl,τ at one detected step (Step 26423). The top-K entries span embedding/final parameters, Q/K/V projections, FFN input weights, and residual output projections. (Right) Per-rank loss across four snapshots (Steps 26423, 26430, 26434, 26437). The eight ranks stay tightly clustered (σ ∈ [0.0… view at source ↗
Figure 10
Figure 10. Figure 10: Attention-branch-only MV-Split control (1000 layers). (Left) Top parameter-family gradient norms at the detected step (Step 7415). With the attention output branch protected, no Attn_WO entries appear in the top-K; the largest printed entries are FFN_W2. (Right) Per-rank loss over four consecutive steps (7415–7419). Cross-rank losses stay tightly clustered (σ ∈ [0.011, 0.023]) while the loss rises uniform… view at source ↗
Figure 11
Figure 11. Figure 11: Real-image timestep linear probe. The backbone has no explicit timestep embedding or AdaLN modulation. (a) The trained image-token mean mimg predicts t with near-perfect linear R2 across depth. The text-token mean mtxt becomes predictive after a few single-stream layers, indicating that the trained model routes image-derived timestep information into the text-token side. (b) Adding hidden-state summaries … view at source ↗
Figure 12
Figure 12. Figure 12: Full-horizon training loss for the MV-Split 400-layer and 1000-layer runs. Note that the SFT and DPO stages use a separately curated ∼50k image set rather than the ImageNet-2012 pre-training distribution; since loss values are data-dependent, the curves are shown for reference only. 28 view at source ↗
Figure 13
Figure 13. Figure 13: Class “Alligator lizard” (044). Euler sampler, 35 NFE, CFG view at source ↗
Figure 14
Figure 14. Figure 14: Class “Scorpion” (071). Euler sampler, 35 NFE, CFG view at source ↗
Figure 15
Figure 15. Figure 15: Class “Jacamar” (095). Euler sampler, 35 NFE, CFG view at source ↗
Figure 16
Figure 16. Figure 16: Class “Rhodesian ridgeback” (159). Euler sampler, 35 NFE, CFG view at source ↗
Figure 17
Figure 17. Figure 17: Class “Bloodhound” (163). Euler sampler, 35 NFE, CFG view at source ↗
Figure 18
Figure 18. Figure 18: Class “Bouvier des Flandres” (233). Euler sampler, 35 NFE, CFG view at source ↗
Figure 19
Figure 19. Figure 19: Class “White wolf” (270). Euler sampler, 35 NFE, CFG view at source ↗
Figure 20
Figure 20. Figure 20: Class “Chimpanzee” (367). Euler sampler, 35 NFE, CFG view at source ↗
Figure 21
Figure 21. Figure 21: Class “Giant panda” (388). Euler sampler, 35 NFE, CFG view at source ↗
Figure 22
Figure 22. Figure 22: Class “Beaker” (438). Euler sampler, 35 NFE, CFG view at source ↗
Figure 23
Figure 23. Figure 23: Class “Caldron” (469). Euler sampler, 35 NFE, CFG view at source ↗
Figure 24
Figure 24. Figure 24: Class “Candle” (470). Euler sampler, 35 NFE, CFG view at source ↗
Figure 25
Figure 25. Figure 25: Class “Car wheel” (479). Euler sampler, 35 NFE, CFG view at source ↗
Figure 26
Figure 26. Figure 26: Class “Coffeepot” (505). Euler sampler, 35 NFE, CFG view at source ↗
Figure 27
Figure 27. Figure 27: Class “Convertible” (511). Euler sampler, 35 NFE, CFG view at source ↗
Figure 28
Figure 28. Figure 28: Class “Crock Pot” (521). Euler sampler, 35 NFE, CFG view at source ↗
Figure 29
Figure 29. Figure 29: Class “Drum” (541). Euler sampler, 35 NFE, CFG view at source ↗
Figure 30
Figure 30. Figure 30: Class “Envelope” (549). Euler sampler, 35 NFE, CFG view at source ↗
Figure 31
Figure 31. Figure 31: Class “Flute” (558). Euler sampler, 35 NFE, CFG view at source ↗
Figure 32
Figure 32. Figure 32: Class “Freight car” (565). Euler sampler, 35 NFE, CFG view at source ↗
Figure 33
Figure 33. Figure 33: Class “French horn” (566). Euler sampler, 35 NFE, CFG view at source ↗
Figure 34
Figure 34. Figure 34: Class “Greenhouse” (580). Euler sampler, 35 NFE, CFG view at source ↗
Figure 35
Figure 35. Figure 35: Class “Horse cart” (603). Euler sampler, 35 NFE, CFG view at source ↗
Figure 36
Figure 36. Figure 36: Class “Knot” (616). Euler sampler, 35 NFE, CFG view at source ↗
Figure 37
Figure 37. Figure 37: Class “Loupe” (633). Euler sampler, 35 NFE, CFG view at source ↗
Figure 38
Figure 38. Figure 38: Class “Mask” (643). Euler sampler, 35 NFE, CFG view at source ↗
Figure 39
Figure 39. Figure 39: Class “Minivan” (656). Euler sampler, 35 NFE, CFG view at source ↗
Figure 40
Figure 40. Figure 40: Class “Mitten” (658). Euler sampler, 35 NFE, CFG view at source ↗
Figure 41
Figure 41. Figure 41: Class “Monastery” (663). Euler sampler, 35 NFE, CFG view at source ↗
Figure 42
Figure 42. Figure 42: Class “Mountain bike” (671). Euler sampler, 35 NFE, CFG view at source ↗
Figure 43
Figure 43. Figure 43: Class “Pool table” (736). Euler sampler, 35 NFE, CFG view at source ↗
Figure 44
Figure 44. Figure 44: Class “Pot” (738). Euler sampler, 35 NFE, CFG view at source ↗
Figure 45
Figure 45. Figure 45: Class “Rugby ball” (768). Euler sampler, 35 NFE, CFG view at source ↗
Figure 46
Figure 46. Figure 46: Class “Scoreboard” (781). Euler sampler, 35 NFE, CFG view at source ↗
Figure 47
Figure 47. Figure 47: Class “Sweatshirt” (841). Euler sampler, 35 NFE, CFG view at source ↗
Figure 48
Figure 48. Figure 48: Class “Teapot” (849). Euler sampler, 35 NFE, CFG view at source ↗
Figure 49
Figure 49. Figure 49: Class “Trombone” (875). Euler sampler, 35 NFE, CFG view at source ↗
Figure 50
Figure 50. Figure 50: Class “Windsor tie” (906). Euler sampler, 35 NFE, CFG view at source ↗
Figure 51
Figure 51. Figure 51: Class “Alp” (970). Euler sampler, 35 NFE, CFG view at source ↗
Figure 52
Figure 52. Figure 52: Class “Groom” (982). Euler sampler, 35 NFE, CFG view at source ↗
read the original abstract

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies Mean Mode Screaming (MMS) as a collapse mechanism in deep Diffusion Transformers driven by mean-coherent gradient shocks and attention-logit suppression. It proposes Mean-Variance Split (MV-Split) residuals that separate centered updates from a leaky mean trunk. On 400-layer single-stream DiTs, MV-Split prevents the divergent collapse seen in the baseline while outperforming LayerScale; a single 1000-layer DiT run is presented as scale validation showing stable training at extreme depth.

Significance. If the causal role of MV-Split holds, the work could meaningfully aid scaling of DiT architectures by targeting a specific residual-stream instability. The 400-layer controlled comparisons and the gradient decomposition provide concrete empirical and mechanistic grounding; the 1000-layer run, while ambitious, remains preliminary.

major comments (2)
  1. [1000-layer scale-validation run (final paragraph of abstract and results)] The 1000-layer DiT scale-validation run reports only a single successful MV-Split training trajectory. No matched 1000-layer baseline without MV-Split (or with only other stabilizations) is shown, so the claim that MV-Split enables stable training at boundary scales rests on extrapolation from the 400-layer regime rather than direct isolation of the effect.
  2. [Mechanistic auditing and gradient decomposition section] The mechanistic decomposition of MMS gradients into mean-coherent and centered components is central to motivating MV-Split, yet the manuscript provides no quantitative verification (e.g., measured gradient norms or ablation of the null-space suppression) that this decomposition fully accounts for the observed collapse independent of other training dynamics.
minor comments (2)
  1. The precise definition and measurement of 'mean-coherent backward shock' and 'silent collapse' should be stated explicitly with equations or pseudocode so readers can reproduce the auditing procedure.
  2. Hyperparameter schedules, initialization details, and any additional stabilization tricks used in the 1000-layer run are not listed; adding them would improve reproducibility even if the primary contribution is the residual modification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below, indicating planned revisions where the manuscript can be strengthened without misrepresenting the presented results.

read point-by-point responses
  1. Referee: [1000-layer scale-validation run (final paragraph of abstract and results)] The 1000-layer DiT scale-validation run reports only a single successful MV-Split training trajectory. No matched 1000-layer baseline without MV-Split (or with only other stabilizations) is shown, so the claim that MV-Split enables stable training at boundary scales rests on extrapolation from the 400-layer regime rather than direct isolation of the effect.

    Authors: We acknowledge that the 1000-layer result is a single successful MV-Split trajectory with no matched baseline at that depth. The computational cost of 1000-layer DiT training makes controlled ablations at this scale impractical. The run is presented strictly as scale validation to demonstrate that stable training remains feasible at extreme depth once the 400-layer collapse is mitigated, rather than as direct causal isolation. We will revise the abstract and results to clarify this distinction and avoid any implication of a controlled comparison at 1000 layers. revision: partial

  2. Referee: [Mechanistic auditing and gradient decomposition section] The mechanistic decomposition of MMS gradients into mean-coherent and centered components is central to motivating MV-Split, yet the manuscript provides no quantitative verification (e.g., measured gradient norms or ablation of the null-space suppression) that this decomposition fully accounts for the observed collapse independent of other training dynamics.

    Authors: The decomposition into mean-coherent and centered gradient components follows directly from the residual-stream algebra and is exact under the stated assumptions. The 400-layer controlled experiments provide empirical corroboration by showing that isolating the centered component prevents collapse. We agree that explicit quantitative checks—such as measured norms of the mean versus centered gradient components and a targeted ablation of the Softmax null-space suppression—would strengthen the mechanistic section. These analyses will be added to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal and scale validation

full rationale

The paper's core contribution is an empirical identification of Mean Mode Screaming via gradient auditing, followed by the MV-Split residual proposal and training runs at 400 and 1000 layers. No derivation chain exists that reduces a claimed prediction or first-principles result to quantities defined by the paper's own fitted parameters, self-citations, or ansatzes; the stability claim at extreme depth is presented as an observed outcome of the architectural change rather than a mathematically forced equivalence. The work is self-contained against external benchmarks of training stability and does not invoke load-bearing self-citations or uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable. The MV-Split is introduced as a new architectural component whose precise formulation and any associated constants are not detailed here.

pith-pipeline@v0.9.0 · 5527 in / 1035 out tokens · 81573 ms · 2026-05-08T13:32:37.194301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

  1. [1]

    Scaling laws for diffusion transformers, 2024

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers, 2024

  2. [2]

    Denoising diffusion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

  3. [3]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021

  4. [4]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023

  5. [5]

    Cottrell, and Julian McAuley

    Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth, 2020

  6. [6]

    Going deeper with image transformers, 2021

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers, 2021

  7. [7]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017

  8. [8]

    On layer normalization in the transformer architecture, 2020

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture, 2020

  9. [9]

    Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

  10. [10]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2013

  11. [11]

    High-resolution image synthesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

  12. [12]

    All are worth words: A vit backbone for diffusion models, 2022

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models, 2022

  13. [13]

    Scaling rectified flow transformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024

  14. [14]

    Lumina-image 2.0: A unified and efficient image generative framework, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025

  15. [15]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 2025

    Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 2025

  16. [16]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  17. [17]

    Visionllama: A unified llama backbone for vision tasks, 2024

    Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks, 2024

  18. [18]

    Fit: Flexible vision transformer for diffusion model, 2024

    Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model, 2024

  19. [19]

    Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017

    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017

  20. [20]

    Dauphin, and Tengyu Ma

    Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization, 2019

  21. [21]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 10

  22. [22]

    Unveiling the secret of adaln-zero in diffusion transformer.https://openreview.net/forum?id=E4roJSM9RM, 2025

    Jie Zhu, Mingyu Ding, Boqiang Duan, Leye Wang, and Jingdong Wang. Unveiling the secret of adaln-zero in diffusion transformer.https://openreview.net/forum?id=E4roJSM9RM, 2025. ICLR 2025

  23. [23]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  24. [24]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

  25. [25]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2022

  26. [26]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015

  27. [27]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

  28. [28]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  29. [29]

    Diffusion models beat gans on image synthesis, 2021

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021

  30. [30]

    Is noise conditioning necessary for denoising generative models? InProceedings of the 42nd International Conference on Machine Learning, 2025

    Qiao Sun, Zhicheng Jiang, Hanhong Zhao, and Kaiming He. Is noise conditioning necessary for denoising generative models? InProceedings of the 42nd International Conference on Machine Learning, 2025

  31. [31]

    The geometry of noise: Why diffusion models don’t need noise conditioning, 2026

    Mojtaba Sahraee-Ardakan, Mauricio Delbracio, and Peyman Milanfar. The geometry of noise: Why diffusion models don’t need noise conditioning, 2026

  32. [32]

    Understanding the difficulty of training transformers, 2020

    Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers, 2020

  33. [33]

    Deepnet: Scaling transformers to 1,000 layers, 2022

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers, 2022

  34. [34]

    Post-layernorm is back: Stable, expressive, and deep, 2026

    Chen Chen and Lai Wei. Post-layernorm is back: Stable, expressive, and deep, 2026

  35. [35]

    Spike no more: Stabilizing the pre-training of large language models, 2023

    Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models, 2023. Published at COLM 2025

  36. [36]

    Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D

    Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies for large-scale transformer training instabilities, 2023

  37. [37]

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2021

    Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2021

  38. [38]

    Signal propagation in transformers: Theoretical perspectives and the role of rank collapse, 2022

    Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse, 2022

  39. [39]

    Stabilizing transformer training by preventing attention entropy collapse, 2023

    Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing transformer training by preventing attention entropy collapse, 2023

  40. [40]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

  41. [41]

    Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022

  42. [42]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 11

  43. [43]

    Training deep nets with sublinear memory cost, 2016

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016

  44. [44]

    Query-key normalization for transformers, 2020

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers, 2020

  45. [45]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F....

  46. [46]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness, 2022

  47. [47]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery

  48. [48]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

  49. [49]

    Efficient streaming language models with attention sinks, 2023

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2023

  50. [50]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github. io/posts/muon/, 2024

  51. [51]

    Muon is scalable for llm training, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  52. [52]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023

  53. [53]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

  54. [54]

    Mamba: Linear-time sequence modeling with selective state spaces, 2023

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023

  55. [55]

    Elucidating the design space of diffusion-based generative models, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022

  56. [56]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12 A Diagnostic Metrics and Definitions Table 2 provides the mathematical definitions for all diagnostic metrics referenced in our analysis. Spatially coherent metrics are estimated robustly on sampled token subsets during the live training pass. Table ...

  57. [57]

    Open-loop vs. leaky-integrator mean dynamics.Projecting the two merges into the mean subspace viaJ: J ZLS l =J X l + (λl ⊙J F l), J Z MV l = (1−α)⊙J X l +α⊙J F l.(40) LayerScale leaves the trunk’s mean componentuntouched at every layer(the coefficient of J Xl is identically 1): it does not damp the carried trunk mean and only scales newly injected branch ...

  58. [58]

    anisotropic gain on the residual branch.By Eq

    Isotropic vs. anisotropic gain on the residual branch.By Eq. 5 the gradient decomposes as ∇W L= ∆W µ + ∆Wc with ∥∆Wµ∥F ∼Tˆκ in the coherent regime and ∥∆Wc∥F scaling diffusively under weak centered alignment. In the scalar-gain simplification, both modes are scaled by the same gain: ∥∆W LS µ ∥F ∥∆W LSc ∥F ∝ √ Tˆκ.(41) For scalar gates like ReZero, the rat...

  59. [59]

    Independent gain on the centered path.Whatever absolute gain on the centered branch is needed for stability at a given depth, MV-Split treats it as a free parameter set independently of α (Eq. 9). LayerScale ties the two paths to the same token-independent per-channel gain, so any reduction in the mean-coherent contribution unavoidably reduces centered re...

  60. [60]

    Spike detection Grad norm > adaptive threshold

  61. [61]

    Per-layer autopsy Top-K params by grad Frobenius norm

  62. [62]

    Root cause exclusion Cross-rank context audit Excluded hypotheses Per-rank loss normal Input grad RMS normal No NaN/Inf in params All clear→internal mechanism

  63. [63]

    timestep

    Type classification Dominant param→failure mode Attn_WO / FFN_W2 / Norm Figure 8:Step-level gradient trace pipeline.A global-norm threshold (1) triggers a per-family top-K ranking of distributed gradient norms (2) and a cross-rank exclusion audit (3) that checks per-rank loss agreement, final-output-gradient RMS, and NaN/Inf in stored parameters. When all...