pith. sign in

arxiv: 2605.22297 · v1 · pith:HHWO7G6Nnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Pith reviewed 2026-05-22 07:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords layerwise learning rateheavy-tailed self-regularizationLLM trainingTransformer optimizationempirical spectral densityadaptive learning rateszero-shot accuracy
0
0 comments X

The pith

Different learning rates for different Transformer layers, set by how heavy-tailed their weights are, speed up training and raise accuracy in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Layerwise Learning Rate (LLR), which gives each Transformer layer its own learning rate based on the heavy-tailedness of its weight correlation matrix. Layers with less heavy-tailed spectra get larger rates to speed them up, while more heavy-tailed ones get smaller rates to avoid instability. This is meant to balance the training across the heterogeneous structure of the model. Experiments on models from 60M to 1B parameters show faster convergence and higher zero-shot performance compared to using one rate for all layers. The method reuses settings from a standard uniform run with little extra tuning.

Core claim

By quantifying the heavy-tailedness of each layer's empirical spectral density using Heavy-Tailed Self-Regularization theory, one can assign larger learning rates to layers with weaker heavy-tailedness and smaller ones to layers with stronger heavy-tailedness. This layer-specific adjustment promotes more balanced training dynamics, resulting in up to 1.5 times faster training and an increase in average zero-shot accuracy from 47.09% to 49.02% across various LLM architectures and optimizers.

What carries the argument

The Heavy-Tailed Self-Regularization (HT-SR) measure of empirical spectral density (ESD) heavy-tailedness for each layer's weight correlation matrix, used to derive per-layer learning rate multipliers that accelerate weaker layers and stabilize stronger ones.

If this is right

  • Training time for LLMs can be reduced by up to 1.5x while maintaining or improving performance.
  • Zero-shot accuracy improves by about 2 percentage points on average without changing the optimizer or architecture.
  • Optimal per-layer rates can be transferred from a single uniform learning rate baseline with minimal additional search.
  • Balanced convergence across layers reduces the risk of some layers lagging behind others in large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to other architectures beyond Transformers if their layer weight matrices show similar spectral heterogeneity.
  • Combining LLR with other adaptive methods like cosine scheduling might yield further gains, though this is not tested in the paper.
  • If the HT-SR measure correlates with layer importance or depth, it might explain why certain layers benefit from higher rates.

Load-bearing premise

The assumption that the heavy-tailedness measure from the spectral density directly indicates how much to adjust the learning rate for each layer, and that this adjustment remains effective without needing recalibration for different runs or scales.

What would settle it

Running the same model training twice, once with uniform learning rate and once with LLR assigned by HT-SR, and finding no statistically significant difference in final accuracy or training speed would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.22297 by Di He, Keyu Wang, Lu Yin, Shiwei Liu, Songjun Tu.

Figure 1
Figure 1. Figure 1: Learning rate sensitivity of different layerwise LR methods on LLaMa-60M, LLaMa-135M and Nano-GPT pre-training. Across all methods, data points failing to converge, resulting in excessively high perplexity values (exceeding the axis limits), are denoted by a double dash (“=”). 1.0 1.5 2.0 2.5 3.0 3.5 PL_Alpha_Hill Mean Embed FFN Att.v/o Att.q/k Uniform LLR 0 10k 20k 30k 40k Number of steps 2.3 2.9 3.5 Trai… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Mean PL_Alpha_Hill value comparison (more balanced is preferred) between Uniform LR and LLR, with LLaMa-1B on FineWeb; Middle: Training loss curves of LLaMa-1B and LLaMa-3B under the AdamW optimizer; Right: Layerwise learning rate schedules when combining LLR with cosine decay scheduler. Performance comparison: LLR (Base LR=5e-4) obtains Perplexity (↓) =9.59 and Q-A Acc. (↑) =49.02%; Uniform (LR=5e-4… view at source ↗
Figure 3
Figure 3. Figure 3: The bars denote the Learning Rate, and the line indicates the PL_Alpha_Hill. Given the imbalanced layerwise PL_Alpha_Hill of LLaMa-60M, LLR assigns lower LR to layers with lower PL_Alpha_Hill. 0.0001 0.0005 0.001 0.005 Learning Rate 15 25 35 Perplexity Embedding Gains of LLR Uniform LLR-unembed LLR 0 1k 2k 3k 4k 5k Step 0 0.001 0.002 Learning Rate LR Switching Schemes Soft Switch Hard Switch Uniform 10% 20… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Embedding gains of LLR compared to baselines across different LRs with LLaMa-135M. “LLR-unembed” denotes the ablation variant without the tailored spectral-based adjustment for Embedding layer; Middle: Illustration of hard (perplexity = 17.18) versus soft (perplexity = 17.03) LR switching schemes with LLaMa-135M; Right: Impact of LLR phase duration on perplexity and extra time cost with LLaMa-135M. T… view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of PL_Alpha_Hill for different parameter groups when training LLaMA-135M with LLR (perplexity = 17.03) and Uniform (perplexity = 17.86). The curves are computed every 100 training steps. Algorithm 1 LLR Input: Global LR η, number of training steps tmax, interval t˜ of using LLR, maximum scaling ratio s, switching steps tswitch, and α i T refers to ith layer’s PL_Alpha_Hill at update step T ∈ {0,t… view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of LLaMA-350M with the AdamW optimizer over 20B tokens: training loss and zero-shot accuracy evaluated every 1,000 steps, including the average performance across eight downstream tasks [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training loss for the LLaMa-135M and LLaMa-350M model, comparing Uniform (3B/7B and 4.5B/10.5B tokens) against LLR (3B/7B tokens). The experiments employ AdamW under two different learning rates. Speedup [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of LLaMA-350M with the Muon optimizer over 20B tokens. 4.3. More Analysis In this section, we evaluate the downstream gains of all methods on finetuning tasks and the proposed LLR across more model architecture and optimizer to demonstrate its generality and robustness. Finetuning Results. We evaluate all methods on finetuning tasks using roberta-base from the Commonsense Rea￾soning datas… view at source ↗
Figure 9
Figure 9. Figure 9: (Varying HT-SR metrics). Comparing PL_Alpha_Hill with multiple metrics under different learning rate settings. The value on the top of each bar indicates the difference from the leftmost bar in each plot and the same processing is applied in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (Varying PL fitting methods). Analysis of three PL fitting methods—Goodness-of-fit, Fix-finger, and Median—across LLaMa-60M and LLaMa-135M. Varying PL fitting gaps. To evaluate the effect of PL fitting frequency on model performance, we compare different up￾date gaps under identical experimental conditions in [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (Varying PL fitting gaps). PL fitting is conducted at varying gaps across training steps to evaluate the trade-off between performance. Bars represent validation perplexity (lower is better). Uniform [1,2] [1,3] [1,4] [1,5] 19.20 20.00 20.80 21.60 Perplexity -0.71 -0.56 -1.58 -1.64 (a) LLaMa-60M 16.20 16.80 17.40 18.00 Perplexity -0.46 -0.80 -0.81 -0.83 (b) LLaMa-135M [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 12
Figure 12. Figure 12: (Hyperparameter study on (1, s)). Results of a hyperparameter search for (1, s) across LLaMa-60M and LLaMa￾135M on the FineWeb dataset. The bar plots display validation perplexity (lower is better). as 500 training steps, the method maintains competitive perplexity, demonstrating robustness. These results justify using a larger fitting gap in practice to balance performance. Hyperparameter study on (1, s)… view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of ESD distributions across layers of LLaMa-135M under different training methods (LLR: Perplex￾ity=17.03 vs. Uniform: Perplexity=17.86). Attention-related layers (e.g., att.q, att.k) exhibit notably heavier spectral tails in contrast to FFN-associated layers. Our method systematically bal￾ances the heavy-tailed properties across layers by appropriately configuring layer-wise LR, thereby enhanc… view at source ↗
Figure 14
Figure 14. Figure 14: Evolution of PL_Alpha_Hill distributions from weights of LLaMA-135M components across iterations during pre-training on the Fineweb dataset, trained by AdamW with uni￾form LR. ity in the PL_Alpha_Hill values among different layers. Specifically, attention-related components (e.g., Att.q/k, Att.v/o) consistently exhibit lower PL_Alpha_Hill values compared to FFN layers and Embeddings through￾out the traini… view at source ↗
read the original abstract

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M-1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Layerwise Learning Rate (LLR), an adaptive per-layer learning rate scheme for Transformer-based LLMs. It leverages Heavy-Tailed Self-Regularization (HT-SR) theory to quantify the heavy-tailedness of the empirical spectral density (ESD) of each layer's weight correlation matrix. Layers exhibiting weaker heavy-tailedness receive larger learning rates to accelerate convergence, while those with stronger heavy-tailedness are assigned smaller rates. Experiments span architectures (LLaMA to GPT-nano), optimizers (AdamW, Muon), and scales (60M–1B parameters), reporting up to 1.5× training speedup and an increase in average zero-shot accuracy from 47.09% to 49.02%, with the practical benefit of transferring near-optimal settings from a single uniform-LR baseline run.

Significance. If the HT-SR-guided mapping from layer-wise heavy-tailedness to LR multipliers can be shown to be robust, near-optimal, and superior to generic heterogeneous schedules, the work would offer a principled, low-overhead approach to handling structural heterogeneity in large Transformers. The broad experimental coverage across models, optimizers, and scales, together with public code, provides a reproducible empirical basis that could influence practical LLM training pipelines.

major comments (2)
  1. [Experiments] Experiments section: No ablation isolates the specific HT-SR-to-LR mapping. Controls such as reversed assignment (stronger heavy-tailed ESD receiving larger LR), random per-layer multipliers sampled from the same range, or a learned mapping are absent. Without them the reported speedups and accuracy gains cannot be attributed to the heavy-tailedness signal rather than heterogeneity alone.
  2. [Method] Method and abstract: The precise functional form or selection procedure that converts the HT-SR heavy-tailedness metric into a per-layer LR multiplier is not specified in sufficient detail, nor is its stability across random seeds, optimizer choice, or model scale demonstrated. This mapping is load-bearing for the transferability claim.
minor comments (2)
  1. [Abstract] Abstract: The baseline against which the 47.09% to 49.02% zero-shot accuracy lift is measured should be stated explicitly, along with the precise set of tasks contributing to the average.
  2. [Results] Figures/tables: Layer-wise ESD heavy-tailedness values and the resulting LR multipliers should be tabulated or plotted for at least one representative model to allow direct inspection of the mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical validation and methodological clarity of our work. We address each major comment below and commit to incorporating the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation isolates the specific HT-SR-to-LR mapping. Controls such as reversed assignment (stronger heavy-tailed ESD receiving larger LR), random per-layer multipliers sampled from the same range, or a learned mapping are absent. Without them the reported speedups and accuracy gains cannot be attributed to the heavy-tailedness signal rather than heterogeneity alone.

    Authors: We agree that the current experiments do not fully isolate the contribution of the specific HT-SR-guided mapping from the general benefit of using heterogeneous learning rates. In the revision, we will add the requested ablation studies, including a reversed assignment (larger LRs for stronger heavy-tailed layers), random per-layer multipliers sampled from the same range as our method, and, if feasible within compute limits, a comparison against a learned mapping. These additions will allow us to more rigorously attribute the observed speedups and accuracy improvements to the heavy-tailedness signal. revision: yes

  2. Referee: [Method] Method and abstract: The precise functional form or selection procedure that converts the HT-SR heavy-tailedness metric into a per-layer LR multiplier is not specified in sufficient detail, nor is its stability across random seeds, optimizer choice, or model scale demonstrated. This mapping is load-bearing for the transferability claim.

    Authors: We acknowledge that the manuscript does not provide sufficient detail on the exact functional form or selection procedure used to derive per-layer LR multipliers from the HT-SR heavy-tailedness metric. In the revised version, we will expand the Method section to explicitly describe the computation of the metric from the ESD and the precise mapping rule (including any thresholds or scaling factors). We will also add results showing the stability of this mapping across multiple random seeds, both AdamW and Muon optimizers, and the full range of model scales (60M to 1B). These changes will better support the transferability claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external HT-SR measurements independently

full rationale

The paper grounds LLR in HT-SR theory by computing ESD heavy-tailedness on a uniform-LR baseline run and then assigning per-layer multipliers (weaker heavy-tailedness → larger LR). These multipliers are applied in a separate training run whose final accuracy and speedup are measured independently. No equation or step reduces the reported gains to a fit or definition drawn from the same data used for evaluation. The mapping rule is presented as a direct consequence of the cited theory rather than an ansatz fitted to target metrics or a self-citation chain that forbids alternatives. This is the common honest case of a self-contained method whose central claim rests on external measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the validity of HT-SR as a diagnostic for layer training dynamics and on the assumption that a simple monotonic mapping from heavy-tailedness to learning rate is sufficient; no new entities are postulated and the only free choice is the exact functional form of that mapping.

free parameters (1)
  • heavy-tailedness-to-LR mapping function
    The rule that converts measured ESD heavy-tailedness into a concrete per-layer multiplier is chosen once from the uniform baseline and then frozen; its exact parametrization is not derived from first principles.
axioms (1)
  • domain assumption Heavy-Tailed Self-Regularization theory correctly quantifies the training difficulty of a Transformer layer via the empirical spectral density of its weight correlation matrix.
    Invoked in the abstract as the grounding for assigning larger rates to weaker heavy-tailed layers.

pith-pipeline@v0.9.0 · 5788 in / 1410 out tokens · 24812 ms · 2026-05-22T07:40:59.655959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  2. [2]

    Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618, 2025

    Dey, N., Zhang, B. C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J. Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618,

  3. [3]

    and Liu, L

    Hayou, S. and Liu, L. Optimal embedding learning rate in llms: The effect of vocabulary size.arXiv preprint arXiv:2506.15025,

  4. [4]

    Alphadecay: Module-wise weight de- cay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562,

    He, D., Jaiswal, A., Tu, S., Shen, L., Yuan, G., Liu, S., and Yin, L. Alphadecay: Module-wise weight de- cay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562,

  5. [5]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  6. [6]

    Hu, Z., Wang, L., Lan, Y ., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. K.-W. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large lan- guage models.arXiv preprint arXiv:2304.01933,

  7. [7]

    Muon is Scalable for LLM Training

    Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

  8. [8]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  9. [9]

    Model balancing helps low-data training and fine-tuning

    Liu, Z., Hu, Y ., Pang, T., Zhou, Y ., Ren, P., and Yang, Y . Model balancing helps low-data training and fine-tuning. arXiv preprint arXiv:2410.12178,

  10. [10]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  11. [11]

    Martin, C. H. and Mahoney, M. W. Traditional and heavy- tailed self regularization in neural network models.arXiv preprint arXiv:1901.08276,

  12. [12]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

  13. [13]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    URL https://arxiv.org/abs/ 2406.17557. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training

  14. [14]

    SocialIQA: Commonsense Reasoning about Social Interactions

    ISSN 0001-0782. doi: 10.1145/3474381. URL https: //doi.org/10.1145/3474381. Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y . Socialiqa: Commonsense reasoning about social interac- tions.arXiv preprint arXiv:1904.09728,

  15. [15]

    The sharpness disparity principle in transformers for ac- celerating language model pre-training.arXiv preprint arXiv:2502.19002,

    Wang, J., Wang, M., Zhou, Z., Yan, J., Wu, L., et al. The sharpness disparity principle in transformers for ac- celerating language model pre-training.arXiv preprint arXiv:2502.19002,

  16. [16]

    and Hu, E

    Yang, G. and Hu, E. J. Feature learning in infinite-width neural networks.arXiv preprint arXiv:2011.14522,

  17. [17]

    Large Batch Training of Convolutional Networks

    You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

  18. [18]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

  19. [19]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  20. [20]

    Why transformers need adam: A hessian perspective

    Zhang, Y ., Chen, C., Ding, T., Li, Z., Sun, R., and Luo, Z. Why transformers need adam: A hessian perspective. Advances in neural information processing systems, 37: 131786–131823, 2024a. Zhang, Y ., Chen, C., Li, Z., Ding, T., Wu, C., Kingma, D. P., Ye, Y ., Luo, Z.-Q., and Sun, R. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:24...

  21. [21]

    Details of Experiments This section provides detailed configurations for both pre- training and finetuning experiments

    11 One LR Doesn’t Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs Appendix A. Details of Experiments This section provides detailed configurations for both pre- training and finetuning experiments. In Table 11 and Ta- ble 12, we present the architectural parameters of LLaMa models and the learning rate and weight decay settings for different ...

  22. [22]

    To assess this impact, we evaluate Goodness-of-fit (Alstott et al., 2014; Martin et al., 2021; Clauset et al., 2009), Fix-finger (Yang et al., 2023), and Median (Zhou et al.,

    Varying PL fitting methods.In our proposed framework, the HT-SR metric PL_Alpha_Hill is derived through PL fitting, and the choice of fitting method can affect final training effectiveness. To assess this impact, we evaluate Goodness-of-fit (Alstott et al., 2014; Martin et al., 2021; Clauset et al., 2009), Fix-finger (Yang et al., 2023), and Median (Zhou et al.,