A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
Pith reviewed 2026-05-20 14:15 UTC · model grok-4.3
The pith
Transformer FFN and output projection weights converge to a narrow Weibull shape parameter k around 1.19 across diverse architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Weibull shape parameter k labels the functional class of a weight matrix, with the Transmission Class (FFN modules and W_o) stabilizing in a narrow band of median terminal k in [1.186, 1.204] (cross-family CV = 0.51%) that is shared across SwiGLU/GeLU, Pre-LN/QK-Norm, and 70M-14B sizes, while the Selection Class (W_q, W_k) departs from the Weibull family with severity modulated by storage format; the scale lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family.
What carries the argument
The two-parameter Weibull distribution applied to absolute weight magnitudes, with shape k serving as a functional-class label and scale lambda as a training-progress indicator, fitted independently per matrix using middle-80% probability plots.
If this is right
- Weight matrices can be classified into Transmission or Selection roles based solely on their fitted k value after training.
- The framework enables per-layer and per-step monitoring of training dynamics through changes in lambda.
- Architectural decisions such as grouped-query attention versus separate Q/K storage directly influence how far selection weights depart from the Weibull family.
- k remains stable for transmission components even as model scale increases from 70M to 14B parameters.
Where Pith is reading between the lines
- The narrow convergence of k in transmission weights may reflect an underlying requirement for stable magnitude distributions to support reliable information flow through the network.
- This diagnostic could be applied during training to detect when a component begins to deviate from its expected functional class.
- Similar Weibull analysis might reveal analogous class distinctions in non-transformer architectures such as convolutional networks.
Load-bearing premise
The middle-80% probability-plot fitting protocol yields a k value that is stable and reflects the true functional class of the weight matrix rather than depending on the specific fitting window or preprocessing steps chosen.
What would settle it
If a newly trained transformer from an unseen family shows the FFN and W_o matrices terminating with k values outside the interval [1.186, 1.204], this would indicate that the narrow band is not universal across all architectures.
Figures
read the original abstract
We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-parameter Weibull framework for diagnosing element-wise weight magnitude distributions in transformers. It anchors k ≈ 1.20 at initialization via middle-80% probability-plot fit to the HalfNormal that arises from i.i.d. Gaussian weights, then uses per-matrix fits across training to classify components into Transmission (FFN and W_o, narrow terminal k band [1.186, 1.204], CV 0.51% across 12 models) and Selection (W_q/W_k, k varying by storage and architecture) classes, while showing lambda scaling with sqrt(eta/lambda_wd). Code and database are released.
Significance. If the reported k stability holds under scrutiny, the framework supplies an architecture-independent, per-component diagnostic that separates functional roles from training progress and could complement aggregate statistics for monitoring convergence. The public library and dataset are concrete strengths for reproducibility.
major comments (2)
- Abstract: the central claim of a narrow, architecture-independent k band for the Transmission Class rests on the middle-80% probability-plot fitting protocol, yet no sensitivity analysis to window width (70% vs. 90%), tail truncation, or preprocessing is reported; this leaves open whether the [1.186, 1.204] interval and 0.51% CV are partly produced by the chosen window rather than genuine convergence.
- Abstract: functional classes are defined by the observed ranges of the fitted k values themselves, so the reported 'findings' function as descriptive summaries of the fits rather than independent predictions; this circularity reduces the framework's ability to serve as a measuring stick for unseen architectures or training regimes.
minor comments (2)
- Abstract: the 12 model entries, exact checkpoints, data exclusion rules, and raw fit statistics (R², Kolmogorov-Smirnov, etc.) are not enumerated, making it impossible to assess whether post-hoc choices affect the reported bands.
- Abstract: error bars or per-model variability measures are absent from the median k and CV statements; adding them would strengthen the cross-family consistency claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below. We agree that sensitivity analysis is needed to confirm robustness of the k band and will add it in revision. We clarify the a priori architectural motivation for the classes to address the circularity concern.
read point-by-point responses
-
Referee: Abstract: the central claim of a narrow, architecture-independent k band for the Transmission Class rests on the middle-80% probability-plot fitting protocol, yet no sensitivity analysis to window width (70% vs. 90%), tail truncation, or preprocessing is reported; this leaves open whether the [1.186, 1.204] interval and 0.51% CV are partly produced by the chosen window rather than genuine convergence.
Authors: We agree that the manuscript lacks a sensitivity analysis for the fitting protocol. In the revised version we will report results from re-fitting the Transmission-class matrices using windows of 70%, 80%, and 90%, together with checks on tail truncation and standard preprocessing variants. These additional experiments will quantify any variation in the reported terminal-k interval and CV, allowing readers to assess whether the observed stability is robust to reasonable changes in the protocol. revision: yes
-
Referee: Abstract: functional classes are defined by the observed ranges of the fitted k values themselves, so the reported 'findings' function as descriptive summaries of the fits rather than independent predictions; this circularity reduces the framework's ability to serve as a measuring stick for unseen architectures or training regimes.
Authors: The Transmission and Selection classes are motivated by the distinct functional roles of the matrices inside the transformer: Transmission matrices (FFN and W_o) propagate transformed activations, while Selection matrices (W_q and W_k) compute attention scores. The Weibull shape parameter k is then used as an empirical diagnostic that consistently separates these roles across twelve models from seven families. We will revise the abstract and introduction to state this architectural motivation first, thereby framing the narrow k band as confirmatory evidence rather than the sole definition of the classes, and clarifying the framework's intended use for new architectures. revision: partial
Circularity Check
No significant circularity detected; framework is observational and self-contained.
full rationale
The paper introduces a Weibull fitting protocol as a diagnostic tool and reports empirical observations of k values across model components and families. The Transmission and Selection classes are predefined by architectural roles (FFN/W_o vs. W_q/W_k), with k ranges presented as measured outcomes rather than derived predictions. The initialization anchor (k ~ 1.20 from HalfNormal) uses the same middle-80% protocol as a baseline comparison and does not create a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force results. The central claims consist of cross-model statistics that remain independent of the fitting inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- Weibull shape parameter k
- Weibull scale parameter lambda
axioms (1)
- domain assumption Element-wise absolute weight magnitudes in transformers are adequately described by a Weibull distribution for diagnostic purposes
Reference graph
Works this paper leans on
-
[1]
Journal of Applied Mechanics , year =
Weibull, Waloddi , title =. Journal of Applied Mechanics , year =
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (. 2023 , publisher =. 2305.13245 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Martin, Charles H. and Mahoney, Michael W. , title =. Proceedings of the 36th International Conference on Machine Learning (. 2019 , publisher =. 1901.08276 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Martin, Charles H. and Mahoney, Michael W. , title =. Proceedings of the 2020. 2020 , publisher =. 1901.08278 , archivePrefix =
-
[5]
Advances in Neural Information Processing Systems (
He, Di and Tu, Songjun and Jaiswal, Ajay and Shen, Li and Yuan, Ganzhao and Liu, Shiwei and Yin, Lu , title =. Advances in Neural Information Processing Systems (. 2025 , eprint =
work page 2025
-
[7]
Transformer Circuits Thread , year =
Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...
-
[8]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (. 2019 , publisher =. 1905.09418 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[10]
Advances in Neural Information Processing Systems (
Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , title =. Advances in Neural Information Processing Systems (. 2023 , eprint =
work page 2023
-
[11]
International Conference on Learning Representations (
Kaul, Prannay and Ma, Chengcheng and Elezi, Ismail and Deng, Jiankang , title =. International Conference on Learning Representations (. 2025 , eprint =
work page 2025
-
[13]
Wilder and Schmidt, Mark , title =
Kunstner, Frederik and Chen, Jacques and Lavington, J. Wilder and Schmidt, Mark , title =. International Conference on Learning Representations (. 2023 , eprint =
work page 2023
-
[15]
Nguyen, Toan Q. and Salazar, Julian , title =. International Conference on Spoken Language Translation (. 2019 , eprint =
work page 2019
-
[17]
Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , title =. Proceedings of. 2022 , eprint =
work page 2022
-
[18]
GQA : Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 4895--4901, Singapore, 2023. Association for Computati...
work page 2023
-
[19]
GPT-NeoX-20B : An open-source autoregressive language model
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of BigScience Episode \#...
work page 2022
-
[20]
Quantizable transformers: Removing outliers by helping attention heads do nothing
Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 36, 2023
work page 2023
-
[21]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[22]
Robust layerwise scaling rules by proper weight decay tuning
Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, and Quanquan Gu. Robust layerwise scaling rules by proper weight decay tuning. arXiv preprint arXiv:2510.15262, 2025
-
[23]
AlphaDecay : Module-wise weight decay for heavy-tailed balancing in LLMs
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. AlphaDecay : Module-wise weight decay for heavy-tailed balancing in LLMs . In Advances in Neural Information Processing Systems ( NeurIPS ) , 2025
work page 2025
-
[24]
From attention to activation: Unravelling the enigmas of large language models
Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. In International Conference on Learning Representations ( ICLR ) , 2025
work page 2025
-
[25]
Wilder Lavington, and Mark Schmidt
Frederik Kunstner, Jacques Chen, J. Wilder Lavington, and Mark Schmidt. Noise is not the main factor behind the gap between SGD and Adam on transformers, but sign descent might be. In International Conference on Learning Representations ( ICLR ) , 2023
work page 2023
-
[26]
Charles H. Martin and Michael W. Mahoney. Traditional and heavy-tailed self regularization in neural network models. In Proceedings of the 36th International Conference on Machine Learning ( ICML ) , volume 97 of Proceedings of Machine Learning Research, pages 4284--4293. PMLR, 2019
work page 2019
-
[27]
Charles H. Martin and Michael W. Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 2020 SIAM International Conference on Data Mining ( SDM ) , pages 505--513. SIAM, 2020
work page 2020
-
[28]
Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In International Conference on Spoken Language Translation ( IWSLT ) , 2019
work page 2019
-
[29]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Massive Activations in Large Language Models
Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. COLM 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics ( ACL ) , pages 5797--5808, Florence, Italy, 2019. Association for Computational Linguistics
work page 2019
-
[32]
How to set AdamW 's weight decay as you scale model and dataset size
Xi Wang and Laurence Aitchison. How to set AdamW 's weight decay as you scale model and dataset size. arXiv preprint arXiv:2405.13698, 2024. Preprint; v3 released 1 Jun 2025
-
[33]
A statistical distribution function of wide applicability
Waloddi Weibull. A statistical distribution function of wide applicability. Journal of Applied Mechanics, 18 0 (3): 0 293--297, 1951. doi:10.1115/1.4010337
-
[34]
Mengxia Yu, De Wang, Qi Shan, Colorado J. Reed, and Alvin Wan. The super weight in large language models. arXiv preprint arXiv:2411.07191, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.