A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3
The pith
A self-attentive meta-optimizer dynamically adjusts learning rates and weight decay for each parameter group based on their statistics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaAdamW integrates a self-attention mechanism within the optimizer to produce dynamic modulation factors for per-group learning rates and weight decay. The lightweight Transformer encoder processes statistical features from each parameter group to generate these factors, trained using a meta-learning objective that includes gradient alignment, loss decrease, generalization gap, and an extension of homoscedastic uncertainty weighting with task-specific priorities.
What carries the argument
The lightweight Transformer encoder that acts as a meta-optimizer by taking per-group statistics (gradient norms, momentum norms, correlations) and outputting modulation factors for learning rates and weight decay.
Load-bearing premise
That training a lightweight Transformer on the meta-objective will produce stable and beneficial adjustments to learning rates and weight decay without instability, overfitting, or excessive computational cost.
What would settle it
Observing that MetaAdamW fails to improve or degrades performance on a new task or model architecture compared to AdamW, or that training becomes unstable due to the meta-modulations.
Figures
read the original abstract
Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss balancing. Extensive experiments on five diverse tasks-time series forecasting (ETT), language modeling (WikiText-2), machine translation (Multi30k), image classification (CIFAR-10), and sentiment analysis (IMDB) - demonstrate that MetaAdamW consistently outperforms the standard AdamW baseline in terms of validation loss, accuracy, or perplexity. Depending on the task, MetaAdamW either reduces overall training time (by up to 17.11%) or improves performance (by up to 11.08%) while introducing only moderate overhead; in some cases, it can also mitigate issues of insufficient convergence caused by premature early stopping. Ablation studies validate the effectiveness of each component, including feature versions, grouping strategies, and the proposed priority-injected uncertainty weighting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MetaAdamW, an optimizer that augments AdamW with a lightweight Transformer encoder. The encoder takes statistical features (gradient norms, momentum norms, correlations) from parameter groups and outputs per-group modulation factors for learning rate and weight decay. These factors are learned by optimizing a meta-objective that combines gradient alignment, loss decrease, and generalization gap, extended via priority-injected homoscedastic uncertainty weighting. Experiments across five tasks (ETT forecasting, WikiText-2 language modeling, Multi30k translation, CIFAR-10 classification, IMDB sentiment) report consistent gains over AdamW in validation metrics or training time, with ablations on features, grouping, and the weighting scheme.
Significance. If the empirical gains are reproducible and not artifacts of baseline tuning or task-specific overfitting, the work offers a concrete mechanism for group-adaptive hyperparameter modulation inside a single training run. The priority-injected HUW extension and the use of self-attention on per-group statistics are technically interesting. Credit is given for the breadth of the five-task evaluation and the inclusion of ablation studies on grouping strategies and feature sets.
major comments (3)
- [Meta-objective definition] Meta-objective definition (methods section): the objective directly includes loss decrease and generalization gap as additive terms. These quantities are downstream outcomes of the very optimization process whose learning-rate and weight-decay schedules are being modulated by the Transformer; this creates a potential circular dependence that is load-bearing for the meta-training procedure. A concrete diagnostic (e.g., ablation that replaces these terms with fixed targets or an analysis of fixed-point stability) is required.
- [Results section] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance on five tasks but supplies no statistical significance tests, no protocol for hyperparameter search on the AdamW baseline, and no details on the number of random seeds or variance estimates. These omissions are load-bearing for the central empirical claim.
- [Experiments section] Training and evaluation protocol (experiments section): meta-training appears to occur on the same task distributions used for final evaluation. This setup risks the Transformer learning task-specific correlations in the supplied statistical features rather than transferable per-group adaptation rules. An explicit cross-task meta-training experiment or hold-out task would directly test this concern.
minor comments (2)
- [Methods] The precise mathematical definitions of the input statistical features (gradient norms, momentum norms, correlations) and the exact architecture of the lightweight Transformer (number of layers, heads, hidden dimension) should be stated explicitly, ideally with pseudocode, to support reproducibility.
- [Results] Tables reporting performance gains should include the measured computational overhead (wall-clock time or FLOPs) for each task so that the “moderate overhead” claim can be assessed quantitatively.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and indicating revisions where the manuscript will be updated to strengthen the presentation.
read point-by-point responses
-
Referee: [Meta-objective definition] Meta-objective definition (methods section): the objective directly includes loss decrease and generalization gap as additive terms. These quantities are downstream outcomes of the very optimization process whose learning-rate and weight-decay schedules are being modulated by the Transformer; this creates a potential circular dependence that is load-bearing for the meta-training procedure. A concrete diagnostic (e.g., ablation that replaces these terms with fixed targets or an analysis of fixed-point stability) is required.
Authors: We acknowledge the interdependence between the modulated hyperparameters and the meta-objective terms. The meta-training procedure is a form of bilevel optimization in which the inner loop applies the modulated schedules and the outer loop optimizes the Transformer to improve the resulting dynamics; this is not a fixed-point circularity but an explicit optimization over modulation policies. Nevertheless, to directly address the concern, the revised manuscript will include a new ablation that replaces the loss-decrease and generalization-gap terms with fixed scalar targets and reports the resulting modulation stability and downstream performance. revision: yes
-
Referee: [Results section] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance on five tasks but supplies no statistical significance tests, no protocol for hyperparameter search on the AdamW baseline, and no details on the number of random seeds or variance estimates. These omissions are load-bearing for the central empirical claim.
Authors: We agree that rigorous statistical reporting is necessary to support the empirical claims. The revised manuscript will report all main results as means and standard deviations over at least five independent random seeds, detail the grid-search protocol used to tune the AdamW baseline (learning rate and weight-decay ranges), and include paired t-test p-values comparing MetaAdamW against the tuned baseline on each task. revision: yes
-
Referee: [Experiments section] Training and evaluation protocol (experiments section): meta-training appears to occur on the same task distributions used for final evaluation. This setup risks the Transformer learning task-specific correlations in the supplied statistical features rather than transferable per-group adaptation rules. An explicit cross-task meta-training experiment or hold-out task would directly test this concern.
Authors: The input features are deliberately chosen as local, task-agnostic statistics (gradient norms, momentum norms, and pairwise correlations) that reflect per-group optimization dynamics rather than global task properties. Ablation results across five heterogeneous tasks already indicate that the learned modulation rules transfer within each task. A full cross-task meta-training protocol would require substantial additional compute and is outside the current scope; we will add an explicit limitations paragraph discussing this point and outlining it as future work. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes an empirical meta-optimizer (MetaAdamW) whose modulation module is trained via a composite meta-objective on statistical features from parameter groups. Evaluation relies on held-out validation metrics (loss, accuracy, perplexity) across five distinct tasks with ablations on features, grouping, and weighting. No equations reduce by construction to the inputs, no self-citation chains justify uniqueness or load-bearing premises, and no ansatz or known result is renamed as a derivation. The design is self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- task-specific priorities
axioms (1)
- domain assumption Statistical features (gradient norms, momentum norms, correlations) extracted per parameter group are sufficient for the attention module to produce beneficial modulation factors.
Reference graph
Works this paper leans on
-
[1]
M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent, 2016
work page 2016
-
[2]
C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks, 2017. 6
work page 2017
-
[3]
A. Kendall, Y . Gal, and R. Cipolla. Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics, 2018
work page 2018
-
[4]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017
work page 2017
-
[5]
Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning, 2017
work page 2017
-
[6]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019
work page 2019
-
[7]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023
work page 2023
-
[8]
S. Wang, J. Sun, and Z. Xu. Hyperadam: A learnable task-adaptive adam for network training, 2018. A META-UPDATEALGORITHM Algorithm 2 computes the hypothetical parametersθ ′ using current Φ, evaluatesL meta, and backpropagates through the attention mod- ule. B TASK-SPECIFICHYPERPARAMETERS Table 4 reports the optimal hyperparameter configuration for each ta...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.