Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Pith reviewed 2026-05-08 13:32 UTC · model grok-4.3
The pith
Mean-Variance Split residuals stop mean-mode collapse and let Diffusion Transformers train stably at 1000 layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Networks enter a mean-dominated collapse state called Mean Mode Screaming when an exact decomposition of gradients into mean-coherent backward shocks and centered components interacts with structural suppression of attention-logit gradients through the null space of the Softmax Jacobian; Mean-Variance Split residuals counteract the collapse by combining a separately gained centered residual update with a leaky trunk-mean replacement, so that a 400-layer DiT avoids divergent failure and a 1000-layer DiT remains stably trainable.
What carries the argument
Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement to preserve both mean and variance dynamics.
If this is right
- A 400-layer single-stream DiT avoids the divergent collapse that crashes the baseline.
- The stabilized model tracks the baseline trajectory up to the crash point while outperforming token-isotropic methods like LayerScale over the full schedule.
- A 1000-layer DiT remains stably trainable, confirming the architecture works at boundary scales.
- MMS can be detected and blocked even when training initially appears stable.
Where Pith is reading between the lines
- The same split-residual pattern could be tested in other residual-heavy generative architectures to see whether mean-mode collapse is a general depth limit.
- Scaling studies for diffusion models could now treat extreme depth as a controllable variable once MMS is mitigated.
- If MV-Split preserves variance flow, it may also improve sample diversity in very deep models compared with mean-only stabilizations.
Load-bearing premise
The stability seen in the 1000-layer run is produced by the MV-Split residuals rather than other unstated training choices or baseline stabilizations.
What would settle it
Train the identical 1000-layer DiT without MV-Split residuals and check whether it enters the same mean-dominated collapse and crashes as the unstabilized baseline.
Figures
read the original abstract
Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies Mean Mode Screaming (MMS) as a collapse mechanism in deep Diffusion Transformers driven by mean-coherent gradient shocks and attention-logit suppression. It proposes Mean-Variance Split (MV-Split) residuals that separate centered updates from a leaky mean trunk. On 400-layer single-stream DiTs, MV-Split prevents the divergent collapse seen in the baseline while outperforming LayerScale; a single 1000-layer DiT run is presented as scale validation showing stable training at extreme depth.
Significance. If the causal role of MV-Split holds, the work could meaningfully aid scaling of DiT architectures by targeting a specific residual-stream instability. The 400-layer controlled comparisons and the gradient decomposition provide concrete empirical and mechanistic grounding; the 1000-layer run, while ambitious, remains preliminary.
major comments (2)
- [1000-layer scale-validation run (final paragraph of abstract and results)] The 1000-layer DiT scale-validation run reports only a single successful MV-Split training trajectory. No matched 1000-layer baseline without MV-Split (or with only other stabilizations) is shown, so the claim that MV-Split enables stable training at boundary scales rests on extrapolation from the 400-layer regime rather than direct isolation of the effect.
- [Mechanistic auditing and gradient decomposition section] The mechanistic decomposition of MMS gradients into mean-coherent and centered components is central to motivating MV-Split, yet the manuscript provides no quantitative verification (e.g., measured gradient norms or ablation of the null-space suppression) that this decomposition fully accounts for the observed collapse independent of other training dynamics.
minor comments (2)
- The precise definition and measurement of 'mean-coherent backward shock' and 'silent collapse' should be stated explicitly with equations or pseudocode so readers can reproduce the auditing procedure.
- Hyperparameter schedules, initialization details, and any additional stabilization tricks used in the 1000-layer run are not listed; adding them would improve reproducibility even if the primary contribution is the residual modification.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below, indicating planned revisions where the manuscript can be strengthened without misrepresenting the presented results.
read point-by-point responses
-
Referee: [1000-layer scale-validation run (final paragraph of abstract and results)] The 1000-layer DiT scale-validation run reports only a single successful MV-Split training trajectory. No matched 1000-layer baseline without MV-Split (or with only other stabilizations) is shown, so the claim that MV-Split enables stable training at boundary scales rests on extrapolation from the 400-layer regime rather than direct isolation of the effect.
Authors: We acknowledge that the 1000-layer result is a single successful MV-Split trajectory with no matched baseline at that depth. The computational cost of 1000-layer DiT training makes controlled ablations at this scale impractical. The run is presented strictly as scale validation to demonstrate that stable training remains feasible at extreme depth once the 400-layer collapse is mitigated, rather than as direct causal isolation. We will revise the abstract and results to clarify this distinction and avoid any implication of a controlled comparison at 1000 layers. revision: partial
-
Referee: [Mechanistic auditing and gradient decomposition section] The mechanistic decomposition of MMS gradients into mean-coherent and centered components is central to motivating MV-Split, yet the manuscript provides no quantitative verification (e.g., measured gradient norms or ablation of the null-space suppression) that this decomposition fully accounts for the observed collapse independent of other training dynamics.
Authors: The decomposition into mean-coherent and centered gradient components follows directly from the residual-stream algebra and is exact under the stated assumptions. The 400-layer controlled experiments provide empirical corroboration by showing that isolating the centered component prevents collapse. We agree that explicit quantitative checks—such as measured norms of the mean versus centered gradient components and a targeted ablation of the Softmax null-space suppression—would strengthen the mechanistic section. These analyses will be added to the revised manuscript. revision: yes
Circularity Check
No circularity: empirical architecture proposal and scale validation
full rationale
The paper's core contribution is an empirical identification of Mean Mode Screaming via gradient auditing, followed by the MV-Split residual proposal and training runs at 400 and 1000 layers. No derivation chain exists that reduces a claimed prediction or first-principles result to quantities defined by the paper's own fitted parameters, self-citations, or ansatzes; the stability claim at extreme depth is presented as an observed outcome of the architectural change rather than a mathematically forced equivalence. The work is self-contained against external benchmarks of training stability and does not invoke load-bearing self-citations or uniqueness theorems.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scaling laws for diffusion transformers, 2024
Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers, 2024
work page 2024
-
[2]
Denoising diffusion probabilistic models, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020
work page 2020
-
[3]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021
work page 2021
-
[4]
Scalable diffusion models with transformers, 2023
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023
work page 2023
-
[5]
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth, 2020
work page 2020
-
[6]
Going deeper with image transformers, 2021
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers, 2021
work page 2021
-
[7]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017
work page 2017
-
[8]
On layer normalization in the transformer architecture, 2020
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture, 2020
work page 2020
-
[9]
Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[10]
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2013
work page 2013
-
[11]
High-resolution image synthesis with latent diffusion models, 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022
work page 2022
-
[12]
All are worth words: A vit backbone for diffusion models, 2022
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models, 2022
work page 2022
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024
work page 2024
-
[14]
Lumina-image 2.0: A unified and efficient image generative framework, 2025
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025
work page 2025
-
[15]
Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 2025
work page 2025
-
[16]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[17]
Visionllama: A unified llama backbone for vision tasks, 2024
Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks, 2024
work page 2024
-
[18]
Fit: Flexible vision transformer for diffusion model, 2024
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model, 2024
work page 2024
-
[19]
Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017
work page 2017
-
[20]
Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization, 2019
work page 2019
-
[21]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 10
work page 2023
-
[22]
Jie Zhu, Mingyu Ding, Boqiang Duan, Leye Wang, and Jingdong Wang. Unveiling the secret of adaln-zero in diffusion transformer.https://openreview.net/forum?id=E4roJSM9RM, 2025. ICLR 2025
work page 2025
-
[23]
Glu variants improve transformer, 2020
Noam Shazeer. Glu variants improve transformer, 2020
work page 2020
-
[24]
Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022
work page 2022
-
[25]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2022
work page 2022
-
[26]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015
work page 2015
-
[27]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
work page 2025
-
[28]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[29]
Diffusion models beat gans on image synthesis, 2021
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021
work page 2021
-
[30]
Qiao Sun, Zhicheng Jiang, Hanhong Zhao, and Kaiming He. Is noise conditioning necessary for denoising generative models? InProceedings of the 42nd International Conference on Machine Learning, 2025
work page 2025
-
[31]
The geometry of noise: Why diffusion models don’t need noise conditioning, 2026
Mojtaba Sahraee-Ardakan, Mauricio Delbracio, and Peyman Milanfar. The geometry of noise: Why diffusion models don’t need noise conditioning, 2026
work page 2026
-
[32]
Understanding the difficulty of training transformers, 2020
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers, 2020
work page 2020
-
[33]
Deepnet: Scaling transformers to 1,000 layers, 2022
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers, 2022
work page 2022
-
[34]
Post-layernorm is back: Stable, expressive, and deep, 2026
Chen Chen and Lai Wei. Post-layernorm is back: Stable, expressive, and deep, 2026
work page 2026
-
[35]
Spike no more: Stabilizing the pre-training of large language models, 2023
Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models, 2023. Published at COLM 2025
work page 2023
-
[36]
Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies for large-scale transformer training instabilities, 2023
work page 2023
-
[37]
Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2021
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2021
work page 2021
-
[38]
Signal propagation in transformers: Theoretical perspectives and the role of rank collapse, 2022
Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse, 2022
work page 2022
-
[39]
Stabilizing transformer training by preventing attention entropy collapse, 2023
Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing transformer training by preventing attention entropy collapse, 2023
work page 2023
-
[40]
Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019
work page 2019
-
[41]
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022
work page 2022
-
[42]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 11
work page internal anchor Pith review arXiv 2017
-
[43]
Training deep nets with sublinear memory cost, 2016
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016
work page 2016
-
[44]
Query-key normalization for transformers, 2020
Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers, 2020
work page 2020
-
[45]
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F....
work page 2023
-
[46]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness, 2022
work page 2022
-
[47]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery
work page 2019
-
[48]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
work page 2025
-
[49]
Efficient streaming language models with attention sinks, 2023
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2023
work page 2023
-
[50]
Muon: An optimizer for hidden layers in neural networks
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github. io/posts/muon/, 2024
work page 2024
-
[51]
Muon is scalable for llm training, 2025
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page 2025
-
[52]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023
work page 2023
-
[53]
Diffusion model alignment using direct preference optimization, 2023
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023
work page 2023
-
[54]
Mamba: Linear-time sequence modeling with selective state spaces, 2023
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023
work page 2023
-
[55]
Elucidating the design space of diffusion-based generative models, 2022
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022
work page 2022
-
[56]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12 A Diagnostic Metrics and Definitions Table 2 provides the mathematical definitions for all diagnostic metrics referenced in our analysis. Spatially coherent metrics are estimated robustly on sampled token subsets during the live training pass. Table ...
work page internal anchor Pith review arXiv 2022
-
[57]
Open-loop vs. leaky-integrator mean dynamics.Projecting the two merges into the mean subspace viaJ: J ZLS l =J X l + (λl ⊙J F l), J Z MV l = (1−α)⊙J X l +α⊙J F l.(40) LayerScale leaves the trunk’s mean componentuntouched at every layer(the coefficient of J Xl is identically 1): it does not damp the carried trunk mean and only scales newly injected branch ...
-
[58]
anisotropic gain on the residual branch.By Eq
Isotropic vs. anisotropic gain on the residual branch.By Eq. 5 the gradient decomposes as ∇W L= ∆W µ + ∆Wc with ∥∆Wµ∥F ∼Tˆκ in the coherent regime and ∥∆Wc∥F scaling diffusively under weak centered alignment. In the scalar-gain simplification, both modes are scaled by the same gain: ∥∆W LS µ ∥F ∥∆W LSc ∥F ∝ √ Tˆκ.(41) For scalar gates like ReZero, the rat...
-
[59]
Independent gain on the centered path.Whatever absolute gain on the centered branch is needed for stability at a given depth, MV-Split treats it as a free parameter set independently of α (Eq. 9). LayerScale ties the two paths to the same token-independent per-channel gain, so any reduction in the mean-coherent contribution unavoidably reduces centered re...
-
[60]
Spike detection Grad norm > adaptive threshold
-
[61]
Per-layer autopsy Top-K params by grad Frobenius norm
-
[62]
Root cause exclusion Cross-rank context audit Excluded hypotheses Per-rank loss normal Input grad RMS normal No NaN/Inf in params All clear→internal mechanism
-
[63]
Type classification Dominant param→failure mode Attn_WO / FFN_W2 / Norm Figure 8:Step-level gradient trace pipeline.A global-norm threshold (1) triggers a per-family top-K ranking of distributed gradient norms (2) and a cross-rank exclusion audit (3) that checks per-rank loss agreement, final-output-gradient RMS, and NaN/Inf in stored parameters. When all...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.