pith. sign in

arxiv: 2605.18898 · v1 · pith:UIJ5IIEInew · submitted 2026-05-17 · 💻 cs.LG

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

Pith reviewed 2026-05-20 14:15 UTC · model grok-4.3

classification 💻 cs.LG
keywords Weibull distributiontransformer weightsweight magnitude distributionfunctional classificationattention projectionsfeed-forward networkstraining dynamics
0
0 comments X

The pith

Transformer FFN and output projection weights converge to a narrow Weibull shape parameter k around 1.19 across diverse architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using the two-parameter Weibull distribution to analyze the magnitude distributions of individual weight matrices in transformer models. Fitting is done with a middle-80% probability plot protocol that anchors k at approximately 1.20 for random Gaussian weights. Across twelve models from seven families, the authors show that FFN modules and the attention output projection W_o consistently reach terminal k values in the narrow interval [1.186, 1.204]. This Transmission Class shares the band irrespective of activation functions, normalization styles, and model sizes from 70 million to 14 billion parameters. Meanwhile, the attention input projections exhibit k values that depend on how Q and K are stored, and the scale parameter lambda increases during training in proportion to the square root of the learning rate over weight decay.

Core claim

The Weibull shape parameter k labels the functional class of a weight matrix, with the Transmission Class (FFN modules and W_o) stabilizing in a narrow band of median terminal k in [1.186, 1.204] (cross-family CV = 0.51%) that is shared across SwiGLU/GeLU, Pre-LN/QK-Norm, and 70M-14B sizes, while the Selection Class (W_q, W_k) departs from the Weibull family with severity modulated by storage format; the scale lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family.

What carries the argument

The two-parameter Weibull distribution applied to absolute weight magnitudes, with shape k serving as a functional-class label and scale lambda as a training-progress indicator, fitted independently per matrix using middle-80% probability plots.

If this is right

  • Weight matrices can be classified into Transmission or Selection roles based solely on their fitted k value after training.
  • The framework enables per-layer and per-step monitoring of training dynamics through changes in lambda.
  • Architectural decisions such as grouped-query attention versus separate Q/K storage directly influence how far selection weights depart from the Weibull family.
  • k remains stable for transmission components even as model scale increases from 70M to 14B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The narrow convergence of k in transmission weights may reflect an underlying requirement for stable magnitude distributions to support reliable information flow through the network.
  • This diagnostic could be applied during training to detect when a component begins to deviate from its expected functional class.
  • Similar Weibull analysis might reveal analogous class distinctions in non-transformer architectures such as convolutional networks.

Load-bearing premise

The middle-80% probability-plot fitting protocol yields a k value that is stable and reflects the true functional class of the weight matrix rather than depending on the specific fitting window or preprocessing steps chosen.

What would settle it

If a newly trained transformer from an unseen family shows the FFN and W_o matrices terminating with k values outside the interval [1.186, 1.204], this would indicate that the narrow band is not universal across all architectures.

Figures

Figures reproduced from arXiv: 2605.18898 by Tiexin Ding.

Figure 1
Figure 1. Figure 1: Diagnostic framework architecture. The framework comprises four interconnected com￾ponents organized as a layered architecture. Theory (top-left): the Weibull two-parameter primitive f(x; k, λ) = (k/λ)(x/λ) k−1 exp(−(x/λ) k ) with its σ-inverse form λ = σ/p Γ(1 + 2/k) − Γ2(1 + 1/k), anchored by two reference scales: the k anchor (Half-Normal initialization, k0 ≈ 1.20 via middle-80% probability-plot fit — n… view at source ↗
Figure 2
Figure 2. Figure 2: Middle-80% trim is noise-optimal. Theoretical curve of Var(Y )/Var(Y )min as a function of quantile p, where Y = ln(− ln(1−F)) is the Weibull probability transform. Three shaded regions: bottom 10% (red, p < 0.10), middle 80% (green, 0.10 ≤ p ≤ 0.90, the low-noise region used for the k80% fit), top 10% (orange, p > 0.90). The variance diverges at both endpoints and reaches its minimum at p ∗ ≈ 0.797; the b… view at source ↗
Figure 3
Figure 3. Figure 3: Pythia-70m |W| Weibull body fit (left column) + tail evolution (right column) across 4 representative training steps (1, 1000, 5000, 143000) — Transmission Class only (Wo + FFN, Wqkv excluded). The left column shows the middle-80% Weibull fit on the Transmission Class bulk weight distribution (n = 14,155,776 samples per checkpoint). The shape parameter k stays near initialization throughout training (aggre… view at source ↗
Figure 4
Figure 4. Figure 4: k drift dichotomy across the 12-entry cohort. Panel A: k drift magnitude from initialization to terminal checkpoint (Transmission median = −0.014, Selection median = −0.071). Panel B: terminal k positions relative to the Transmission band [1.186, 1.204] (yellow band). Both panels share the same x-axis: Selection components (Wq, Wk, Wqkv) and Transmission components (Wv, Wo, FFN_in, FFN_out, FFN_down); the … view at source ↗
Figure 5
Figure 5. Figure 5: Body–tail ablation across the 12-entry cohort. Three trim protocols (k80%, k90%, k100%) applied to FFN + Wo components across 12 entries spanning 7 architectural families. The middle 80% protocol places 10/12 entries inside the Transmission band [1.186, 1.204]; full-data fit (k100%) places 0/12 inside, with the body–tail gap k80% − k100% = 0.0519 ± 0.0017 (CV = 3.3%) representing the heavy-tail influence o… view at source ↗
Figure 6
Figure 6. Figure 6: OLMo-1 (7B, terminal) per-layer |w| distribution heatmap — 7 components × 32 blocks. q_proj and k_proj (Selection Class) show pronounced specialization tail extending to log10 |w| ≈ −5 in mid-to-deep blocks (blocks 8–17 and 23–27). v_proj and o_proj (Transmission Class) show tight distributions across blocks. FFN gate/up/down (bottom row) show tight, near￾identical distributions across all 32 blocks. The Q… view at source ↗
Figure 7
Figure 7. Figure 7: QK-Norm contrast: cross-family terminal Q/K per-layer heatmap. Three terminal-checkpoint models: OLMo-1 7B (no QK-Norm, red border), Qwen2.5 14B (no QK-Norm, red border), Qwen3 8B (QK-Norm, blue border). Both NO-QK-Norm models show pronounced specialization tail extending to log10 |w| ≈ −5 in shallow-to-mid blocks. Qwen3 8B (with QK￾Norm) shows visibly tighter Q/K distributions; the specialization tail is … view at source ↗
Figure 8
Figure 8. Figure 8: Pythia merged Wqkv per-layer |w| distribution — 4 sizes × 4 training steps. Heatmap density of log10 |w| across blocks, with columns = training step and rows = Pythia size. The bulk distribution stays narrow; the left tail extends progressively, signaling Selection-class specialization within the merged Wqkv tensor. The dimensionless T /τ = T · η · λwd partitions the 5 Pythia sizes into Physical States: Py… view at source ↗
Figure 9
Figure 9. Figure 9: MHA vs GQA dichotomy: terminal kq and kk across the 12-entry co￾hort. Three architectural groups: separately-stored MHA (OLMo-1, OLMo-2; deep Selection, k ∈ [0.76, 0.99]), grouped-query attention (LLaMA-3, Mistral, Qwen2.5-7B/14B, Qwen3; mild Se￾lection, k ∈ [1.10, 1.16]), and Pythia merged Wqkv (transitional, kqkv ∈ [1.05, 1.18], T /τ -monotonic across 70M–6.9B). The architecture-dependent severity is con… view at source ↗
Figure 10
Figure 10. Figure 10: 4-component λ trajectory across Pythia 5 sizes. Four subplots show the median λ per block across training steps for the four Transmission Class kinds (Wqkv, Wo, WFFN_in, WFFN_out), with each subplot overlaying the 5 Pythia sizes (70m–6.9B) color-coded by T /τ Physical State. Subplot ordering follows the transformer forward-pass: Wqkv → Wo → WFFN_in → WFFN_out. The paired growth across Wo and WFFN_out (Pea… view at source ↗
Figure 11
Figure 11. Figure 11: Single-family mean λ trajectory across Pythia 5 sizes. Trajectory of the mean λ (across 4 component kinds) plotted against training step in log scale, showing the cosine LR schedule effect: λ rises through warmup, peaks near step 10k, then decays as η cosine-decreases. The 6.9B trajectory ends at lower T /τ (Transition regime), reflecting the inversely-scaled ηpeak in the Pythia recipe [PITH_FULL_IMAGE:f… view at source ↗
Figure 12
Figure 12. Figure 12: Pythia-410m terminal per-block (k, λ) profile (24 blocks, aggregate fit per block). λ (left) shows depth-dependent rise in deep blocks (16–23, max ∼ 1.22× shallow); k (right) stays within the Transmission band [1.186, 1.204] for most blocks, with slight drop at the deepest 2 blocks (super-weight tail effect). Per-block depth-heterogeneity complements component-paired uniformity (Pearson r = 0.9967 between… view at source ↗
Figure 13
Figure 13. Figure 13: Within-Pythia λ scaling vs. p η/λwd. Terminal mean λ across the three Transmission Class kinds (Wo, WFFN_in, WFFN_out; Wqkv excluded as Selection per Section 2.2) plotted against p ηpeak/λwd for the 5 Pythia sizes. Linear fit through origin gives slope 0.087, Pearson r = 0.94. Per-size deviations of 7–36% indicate directional rather than quantitative match with the Fan et al. (2025) scaling law. 5.5 Relat… view at source ↗
Figure 14
Figure 14. Figure 14: Pythia-70m k + λ trajectory across full training — 14 log-spaced checkpoints (step 1 → step 143,000), Transmission Class only. The left panel shows k(step) on a log x-axis, with the paper Section 3 Transmission band [1.186, 1.204] overlaid (yellow) and the half-Normal anchor k0 ≈ 1.20 as a dashed reference. The right panel shows λ(step). Both k and λ are reported as aggregate fits on pooled Transmission C… view at source ↗
read the original abstract

We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-parameter Weibull framework for diagnosing element-wise weight magnitude distributions in transformers. It anchors k ≈ 1.20 at initialization via middle-80% probability-plot fit to the HalfNormal that arises from i.i.d. Gaussian weights, then uses per-matrix fits across training to classify components into Transmission (FFN and W_o, narrow terminal k band [1.186, 1.204], CV 0.51% across 12 models) and Selection (W_q/W_k, k varying by storage and architecture) classes, while showing lambda scaling with sqrt(eta/lambda_wd). Code and database are released.

Significance. If the reported k stability holds under scrutiny, the framework supplies an architecture-independent, per-component diagnostic that separates functional roles from training progress and could complement aggregate statistics for monitoring convergence. The public library and dataset are concrete strengths for reproducibility.

major comments (2)
  1. Abstract: the central claim of a narrow, architecture-independent k band for the Transmission Class rests on the middle-80% probability-plot fitting protocol, yet no sensitivity analysis to window width (70% vs. 90%), tail truncation, or preprocessing is reported; this leaves open whether the [1.186, 1.204] interval and 0.51% CV are partly produced by the chosen window rather than genuine convergence.
  2. Abstract: functional classes are defined by the observed ranges of the fitted k values themselves, so the reported 'findings' function as descriptive summaries of the fits rather than independent predictions; this circularity reduces the framework's ability to serve as a measuring stick for unseen architectures or training regimes.
minor comments (2)
  1. Abstract: the 12 model entries, exact checkpoints, data exclusion rules, and raw fit statistics (R², Kolmogorov-Smirnov, etc.) are not enumerated, making it impossible to assess whether post-hoc choices affect the reported bands.
  2. Abstract: error bars or per-model variability measures are absent from the median k and CV statements; adding them would strengthen the cross-family consistency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below. We agree that sensitivity analysis is needed to confirm robustness of the k band and will add it in revision. We clarify the a priori architectural motivation for the classes to address the circularity concern.

read point-by-point responses
  1. Referee: Abstract: the central claim of a narrow, architecture-independent k band for the Transmission Class rests on the middle-80% probability-plot fitting protocol, yet no sensitivity analysis to window width (70% vs. 90%), tail truncation, or preprocessing is reported; this leaves open whether the [1.186, 1.204] interval and 0.51% CV are partly produced by the chosen window rather than genuine convergence.

    Authors: We agree that the manuscript lacks a sensitivity analysis for the fitting protocol. In the revised version we will report results from re-fitting the Transmission-class matrices using windows of 70%, 80%, and 90%, together with checks on tail truncation and standard preprocessing variants. These additional experiments will quantify any variation in the reported terminal-k interval and CV, allowing readers to assess whether the observed stability is robust to reasonable changes in the protocol. revision: yes

  2. Referee: Abstract: functional classes are defined by the observed ranges of the fitted k values themselves, so the reported 'findings' function as descriptive summaries of the fits rather than independent predictions; this circularity reduces the framework's ability to serve as a measuring stick for unseen architectures or training regimes.

    Authors: The Transmission and Selection classes are motivated by the distinct functional roles of the matrices inside the transformer: Transmission matrices (FFN and W_o) propagate transformed activations, while Selection matrices (W_q and W_k) compute attention scores. The Weibull shape parameter k is then used as an empirical diagnostic that consistently separates these roles across twelve models from seven families. We will revise the abstract and introduction to state this architectural motivation first, thereby framing the narrow k band as confirmatory evidence rather than the sole definition of the classes, and clarifying the framework's intended use for new architectures. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; framework is observational and self-contained.

full rationale

The paper introduces a Weibull fitting protocol as a diagnostic tool and reports empirical observations of k values across model components and families. The Transmission and Selection classes are predefined by architectural roles (FFN/W_o vs. W_q/W_k), with k ranges presented as measured outcomes rather than derived predictions. The initialization anchor (k ~ 1.20 from HalfNormal) uses the same middle-80% protocol as a baseline comparison and does not create a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force results. The central claims consist of cross-model statistics that remain independent of the fitting inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on empirical fitting of Weibull parameters to observed weight magnitudes rather than on first-principles derivations; the central claims are therefore descriptive of the fitted statistics across the tested models.

free parameters (2)
  • Weibull shape parameter k
    Estimated independently for each weight matrix via middle-80% probability-plot fit at every checkpoint.
  • Weibull scale parameter lambda
    Fitted to data and observed to grow during training and correlate with sqrt(eta/lambda_wd).
axioms (1)
  • domain assumption Element-wise absolute weight magnitudes in transformers are adequately described by a Weibull distribution for diagnostic purposes
    Invoked when applying the two-parameter family to |w| distributions at initialization and throughout training.

pith-pipeline@v0.9.0 · 5935 in / 1456 out tokens · 73912 ms · 2026-05-20T14:15:15.463731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    Journal of Applied Mechanics , year =

    Weibull, Waloddi , title =. Journal of Applied Mechanics , year =

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (. 2023 , publisher =. 2305.13245 , archivePrefix =

  3. [3]

    Traditional and Heavy-Tailed Self Regularization in Neural Network Models

    Martin, Charles H. and Mahoney, Michael W. , title =. Proceedings of the 36th International Conference on Machine Learning (. 2019 , publisher =. 1901.08276 , archivePrefix =

  4. [4]

    and Mahoney, Michael W

    Martin, Charles H. and Mahoney, Michael W. , title =. Proceedings of the 2020. 2020 , publisher =. 1901.08278 , archivePrefix =

  5. [5]

    Advances in Neural Information Processing Systems (

    He, Di and Tu, Songjun and Jaiswal, Ajay and Shen, Li and Yuan, Ganzhao and Liu, Shiwei and Yin, Lu , title =. Advances in Neural Information Processing Systems (. 2025 , eprint =

  6. [7]

    Transformer Circuits Thread , year =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

  7. [8]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (. 2019 , publisher =. 1905.09418 , archivePrefix =

  8. [10]

    Advances in Neural Information Processing Systems (

    Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , title =. Advances in Neural Information Processing Systems (. 2023 , eprint =

  9. [11]

    International Conference on Learning Representations (

    Kaul, Prannay and Ma, Chengcheng and Elezi, Ismail and Deng, Jiankang , title =. International Conference on Learning Representations (. 2025 , eprint =

  10. [13]

    Wilder and Schmidt, Mark , title =

    Kunstner, Frederik and Chen, Jacques and Lavington, J. Wilder and Schmidt, Mark , title =. International Conference on Learning Representations (. 2023 , eprint =

  11. [15]

    and Salazar, Julian , title =

    Nguyen, Toan Q. and Salazar, Julian , title =. International Conference on Spoken Language Translation (. 2019 , eprint =

  12. [17]

    Proceedings of

    Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , title =. Proceedings of. 2022 , eprint =

  13. [18]

    GQA : Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 4895--4901, Singapore, 2023. Association for Computati...

  14. [19]

    GPT-NeoX-20B : An open-source autoregressive language model

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of BigScience Episode \#...

  15. [20]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 36, 2023

  16. [21]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  17. [22]

    Robust layerwise scaling rules by proper weight decay tuning

    Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, and Quanquan Gu. Robust layerwise scaling rules by proper weight decay tuning. arXiv preprint arXiv:2510.15262, 2025

  18. [23]

    AlphaDecay : Module-wise weight decay for heavy-tailed balancing in LLMs

    Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. AlphaDecay : Module-wise weight decay for heavy-tailed balancing in LLMs . In Advances in Neural Information Processing Systems ( NeurIPS ) , 2025

  19. [24]

    From attention to activation: Unravelling the enigmas of large language models

    Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. In International Conference on Learning Representations ( ICLR ) , 2025

  20. [25]

    Wilder Lavington, and Mark Schmidt

    Frederik Kunstner, Jacques Chen, J. Wilder Lavington, and Mark Schmidt. Noise is not the main factor behind the gap between SGD and Adam on transformers, but sign descent might be. In International Conference on Learning Representations ( ICLR ) , 2023

  21. [26]

    Martin and Michael W

    Charles H. Martin and Michael W. Mahoney. Traditional and heavy-tailed self regularization in neural network models. In Proceedings of the 36th International Conference on Machine Learning ( ICML ) , volume 97 of Proceedings of Machine Learning Research, pages 4284--4293. PMLR, 2019

  22. [27]

    Martin and Michael W

    Charles H. Martin and Michael W. Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 2020 SIAM International Conference on Data Mining ( SDM ) , pages 505--513. SIAM, 2020

  23. [28]

    Nguyen and Julian Salazar

    Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In International Conference on Spoken Language Translation ( IWSLT ) , 2019

  24. [29]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  25. [30]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. COLM 2024

  26. [31]

    Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics ( ACL ) , pages 5797--5808, Florence, Italy, 2019. Association for Computational Linguistics

  27. [32]

    How to set AdamW 's weight decay as you scale model and dataset size

    Xi Wang and Laurence Aitchison. How to set AdamW 's weight decay as you scale model and dataset size. arXiv preprint arXiv:2405.13698, 2024. Preprint; v3 released 1 Jun 2025

  28. [33]

    A statistical distribution function of wide applicability

    Waloddi Weibull. A statistical distribution function of wide applicability. Journal of Applied Mechanics, 18 0 (3): 0 293--297, 1951. doi:10.1115/1.4010337

  29. [34]

    Reed, and Alvin Wan

    Mengxia Yu, De Wang, Qi Shan, Colorado J. Reed, and Alvin Wan. The super weight in large language models. arXiv preprint arXiv:2411.07191, 2024