One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Pith reviewed 2026-05-22 07:40 UTC · model grok-4.3
The pith
Different learning rates for different Transformer layers, set by how heavy-tailed their weights are, speed up training and raise accuracy in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By quantifying the heavy-tailedness of each layer's empirical spectral density using Heavy-Tailed Self-Regularization theory, one can assign larger learning rates to layers with weaker heavy-tailedness and smaller ones to layers with stronger heavy-tailedness. This layer-specific adjustment promotes more balanced training dynamics, resulting in up to 1.5 times faster training and an increase in average zero-shot accuracy from 47.09% to 49.02% across various LLM architectures and optimizers.
What carries the argument
The Heavy-Tailed Self-Regularization (HT-SR) measure of empirical spectral density (ESD) heavy-tailedness for each layer's weight correlation matrix, used to derive per-layer learning rate multipliers that accelerate weaker layers and stabilize stronger ones.
If this is right
- Training time for LLMs can be reduced by up to 1.5x while maintaining or improving performance.
- Zero-shot accuracy improves by about 2 percentage points on average without changing the optimizer or architecture.
- Optimal per-layer rates can be transferred from a single uniform learning rate baseline with minimal additional search.
- Balanced convergence across layers reduces the risk of some layers lagging behind others in large models.
Where Pith is reading between the lines
- The approach could be extended to other architectures beyond Transformers if their layer weight matrices show similar spectral heterogeneity.
- Combining LLR with other adaptive methods like cosine scheduling might yield further gains, though this is not tested in the paper.
- If the HT-SR measure correlates with layer importance or depth, it might explain why certain layers benefit from higher rates.
Load-bearing premise
The assumption that the heavy-tailedness measure from the spectral density directly indicates how much to adjust the learning rate for each layer, and that this adjustment remains effective without needing recalibration for different runs or scales.
What would settle it
Running the same model training twice, once with uniform learning rate and once with LLR assigned by HT-SR, and finding no statistically significant difference in final accuracy or training speed would falsify the claim.
Figures
read the original abstract
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes more balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures ranging from LLaMA to GPT-nano, optimizers including AdamW and Muon, and model scales from 60M to 3B parameters with up to 100B training tokens demonstrate the effectiveness of LLR. LLR achieves up to 1.5x training speedup and consistently outperforms uniform-learning-rate baselines. In particular, it improves the average zero-shot accuracy of 1B models from 47.09% to 49.02%, and that of 3B models from 48.58% to 50.61%. A key advantage of LLR is its low tuning overhead: it can transfer nearly optimal learning-rate settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Layerwise Learning Rate (LLR), an adaptive per-layer learning rate scheme for Transformer-based LLMs. It leverages Heavy-Tailed Self-Regularization (HT-SR) theory to quantify the heavy-tailedness of the empirical spectral density (ESD) of each layer's weight correlation matrix. Layers exhibiting weaker heavy-tailedness receive larger learning rates to accelerate convergence, while those with stronger heavy-tailedness are assigned smaller rates. Experiments span architectures (LLaMA to GPT-nano), optimizers (AdamW, Muon), and scales (60M–1B parameters), reporting up to 1.5× training speedup and an increase in average zero-shot accuracy from 47.09% to 49.02%, with the practical benefit of transferring near-optimal settings from a single uniform-LR baseline run.
Significance. If the HT-SR-guided mapping from layer-wise heavy-tailedness to LR multipliers can be shown to be robust, near-optimal, and superior to generic heterogeneous schedules, the work would offer a principled, low-overhead approach to handling structural heterogeneity in large Transformers. The broad experimental coverage across models, optimizers, and scales, together with public code, provides a reproducible empirical basis that could influence practical LLM training pipelines.
major comments (2)
- [Experiments] Experiments section: No ablation isolates the specific HT-SR-to-LR mapping. Controls such as reversed assignment (stronger heavy-tailed ESD receiving larger LR), random per-layer multipliers sampled from the same range, or a learned mapping are absent. Without them the reported speedups and accuracy gains cannot be attributed to the heavy-tailedness signal rather than heterogeneity alone.
- [Method] Method and abstract: The precise functional form or selection procedure that converts the HT-SR heavy-tailedness metric into a per-layer LR multiplier is not specified in sufficient detail, nor is its stability across random seeds, optimizer choice, or model scale demonstrated. This mapping is load-bearing for the transferability claim.
minor comments (2)
- [Abstract] Abstract: The baseline against which the 47.09% to 49.02% zero-shot accuracy lift is measured should be stated explicitly, along with the precise set of tasks contributing to the average.
- [Results] Figures/tables: Layer-wise ESD heavy-tailedness values and the resulting LR multipliers should be tabulated or plotted for at least one representative model to allow direct inspection of the mapping.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical validation and methodological clarity of our work. We address each major comment below and commit to incorporating the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation isolates the specific HT-SR-to-LR mapping. Controls such as reversed assignment (stronger heavy-tailed ESD receiving larger LR), random per-layer multipliers sampled from the same range, or a learned mapping are absent. Without them the reported speedups and accuracy gains cannot be attributed to the heavy-tailedness signal rather than heterogeneity alone.
Authors: We agree that the current experiments do not fully isolate the contribution of the specific HT-SR-guided mapping from the general benefit of using heterogeneous learning rates. In the revision, we will add the requested ablation studies, including a reversed assignment (larger LRs for stronger heavy-tailed layers), random per-layer multipliers sampled from the same range as our method, and, if feasible within compute limits, a comparison against a learned mapping. These additions will allow us to more rigorously attribute the observed speedups and accuracy improvements to the heavy-tailedness signal. revision: yes
-
Referee: [Method] Method and abstract: The precise functional form or selection procedure that converts the HT-SR heavy-tailedness metric into a per-layer LR multiplier is not specified in sufficient detail, nor is its stability across random seeds, optimizer choice, or model scale demonstrated. This mapping is load-bearing for the transferability claim.
Authors: We acknowledge that the manuscript does not provide sufficient detail on the exact functional form or selection procedure used to derive per-layer LR multipliers from the HT-SR heavy-tailedness metric. In the revised version, we will expand the Method section to explicitly describe the computation of the metric from the ESD and the precise mapping rule (including any thresholds or scaling factors). We will also add results showing the stability of this mapping across multiple random seeds, both AdamW and Muon optimizers, and the full range of model scales (60M to 1B). These changes will better support the transferability claim. revision: yes
Circularity Check
No significant circularity; derivation applies external HT-SR measurements independently
full rationale
The paper grounds LLR in HT-SR theory by computing ESD heavy-tailedness on a uniform-LR baseline run and then assigning per-layer multipliers (weaker heavy-tailedness → larger LR). These multipliers are applied in a separate training run whose final accuracy and speedup are measured independently. No equation or step reduces the reported gains to a fit or definition drawn from the same data used for evaluation. The mapping rule is presented as a direct consequence of the cited theory rather than an ansatz fitted to target metrics or a self-citation chain that forbids alternatives. This is the common honest case of a self-contained method whose central claim rests on external measurements.
Axiom & Free-Parameter Ledger
free parameters (1)
- heavy-tailedness-to-LR mapping function
axioms (1)
- domain assumption Heavy-Tailed Self-Regularization theory correctly quantifies the training difficulty of a Transformer layer via the empirical spectral density of its weight correlation matrix.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Layers with weaker heavy-tailedness are assigned larger learning rates... PL_Alpha_Hill... fT(i)=η·(αiT−αminT)/(αmaxT−αminT)(s−1)+1
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
grounded in Heavy-Tailed Self-Regularization (HT-SR) theory... ESD of weight correlation matrices
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.