A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

JiangBo Zhao; ZhaoXin Liu

arxiv: 2605.04055 · v1 · submitted 2026-04-10 · 💻 cs.LG

A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

JiangBo Zhao , ZhaoXin Liu This is my paper

Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords meta-optimizerself-attentionadaptive learning ratesweight decayparameter groupsmeta-learninguncertainty weightingoptimizer

0 comments

The pith

A self-attentive meta-optimizer dynamically adjusts learning rates and weight decay for each parameter group based on their statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MetaAdamW to address the limitation of uniform hyperparameters in optimizers like AdamW by using a lightweight Transformer encoder to modulate learning rates and weight decay per parameter group. This modulation relies on extracting features such as gradient norms and momentum from each group and applying self-attention to determine appropriate adjustments. The meta-objective for training this module combines gradient alignment, loss decrease, and generalization gap, enhanced by priority-injected homoscedastic uncertainty weighting. Experiments on tasks including time series forecasting, language modeling, machine translation, image classification, and sentiment analysis show consistent improvements in validation metrics or reduced training time compared to standard AdamW. A sympathetic reader would care because adaptive per-group optimization could lead to faster and more effective training of neural networks without extensive manual hyperparameter tuning.

Core claim

MetaAdamW integrates a self-attention mechanism within the optimizer to produce dynamic modulation factors for per-group learning rates and weight decay. The lightweight Transformer encoder processes statistical features from each parameter group to generate these factors, trained using a meta-learning objective that includes gradient alignment, loss decrease, generalization gap, and an extension of homoscedastic uncertainty weighting with task-specific priorities.

What carries the argument

The lightweight Transformer encoder that acts as a meta-optimizer by taking per-group statistics (gradient norms, momentum norms, correlations) and outputting modulation factors for learning rates and weight decay.

Load-bearing premise

That training a lightweight Transformer on the meta-objective will produce stable and beneficial adjustments to learning rates and weight decay without instability, overfitting, or excessive computational cost.

What would settle it

Observing that MetaAdamW fails to improve or degrades performance on a new task or model architecture compared to AdamW, or that training becomes unstable due to the meta-modulations.

Figures

Figures reproduced from arXiv: 2605.04055 by JiangBo Zhao, ZhaoXin Liu.

**Figure 1.** Figure 1: , 2, 3, 4 and 5 show the validation curves for each task. MetaAdamW may converge faster, or ultimately perform better, or help avoid premature stopping and achieve better convergence, depending on the task. 4.2.2 Computational Overhead The additional cost of MetaAdamW comes from feature extraction, attention forward/backward, and meta-updates. As shown in Table 2, the total training time change varies: i… view at source ↗

**Figure 2.** Figure 2: WikiText-2 Validation perplexity over epochs. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multi30k Validation perplexity over epochs. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: IMDB Validation accuracy over epochs. extraction cost. In practice, end-to-end training time increases by 10–30%. 4.6 Future Work While MetaAdamW is validated on lightweight models, scaling it to billion-parameter Transformers and more complex tasks remains an important direction for further investigation. The current implementation incorporates a feature gating mechanism, which applies a learnable gate t… view at source ↗

read the original abstract

Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss balancing. Extensive experiments on five diverse tasks-time series forecasting (ETT), language modeling (WikiText-2), machine translation (Multi30k), image classification (CIFAR-10), and sentiment analysis (IMDB) - demonstrate that MetaAdamW consistently outperforms the standard AdamW baseline in terms of validation loss, accuracy, or perplexity. Depending on the task, MetaAdamW either reduces overall training time (by up to 17.11%) or improves performance (by up to 11.08%) while introducing only moderate overhead; in some cases, it can also mitigate issues of insufficient convergence caused by premature early stopping. Ablation studies validate the effectiveness of each component, including feature versions, grouping strategies, and the proposed priority-injected uncertainty weighting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaAdamW uses a small transformer to modulate per-group LR and WD via a meta-objective on gradient stats, but the gains look modest and the setup risks task-specific overfitting rather than general adaptation.

read the letter

The paper's main move is to replace uniform AdamW hyperparameters with a lightweight transformer that reads gradient norms, momentum, and correlations from each parameter group and outputs per-group scaling factors for learning rate and weight decay. It trains this module with a meta-loss that mixes gradient alignment, observed loss decrease, generalization gap, and an extended homoscedastic uncertainty weighting that accepts task-specific priority scalars. Experiments run on ETT forecasting, WikiText-2, Multi30k translation, CIFAR-10, and IMDB, reporting either lower validation loss or shorter wall time than plain AdamW, with ablations on features and grouping.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MetaAdamW, an optimizer that augments AdamW with a lightweight Transformer encoder. The encoder takes statistical features (gradient norms, momentum norms, correlations) from parameter groups and outputs per-group modulation factors for learning rate and weight decay. These factors are learned by optimizing a meta-objective that combines gradient alignment, loss decrease, and generalization gap, extended via priority-injected homoscedastic uncertainty weighting. Experiments across five tasks (ETT forecasting, WikiText-2 language modeling, Multi30k translation, CIFAR-10 classification, IMDB sentiment) report consistent gains over AdamW in validation metrics or training time, with ablations on features, grouping, and the weighting scheme.

Significance. If the empirical gains are reproducible and not artifacts of baseline tuning or task-specific overfitting, the work offers a concrete mechanism for group-adaptive hyperparameter modulation inside a single training run. The priority-injected HUW extension and the use of self-attention on per-group statistics are technically interesting. Credit is given for the breadth of the five-task evaluation and the inclusion of ablation studies on grouping strategies and feature sets.

major comments (3)

[Meta-objective definition] Meta-objective definition (methods section): the objective directly includes loss decrease and generalization gap as additive terms. These quantities are downstream outcomes of the very optimization process whose learning-rate and weight-decay schedules are being modulated by the Transformer; this creates a potential circular dependence that is load-bearing for the meta-training procedure. A concrete diagnostic (e.g., ablation that replaces these terms with fixed targets or an analysis of fixed-point stability) is required.
[Results section] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance on five tasks but supplies no statistical significance tests, no protocol for hyperparameter search on the AdamW baseline, and no details on the number of random seeds or variance estimates. These omissions are load-bearing for the central empirical claim.
[Experiments section] Training and evaluation protocol (experiments section): meta-training appears to occur on the same task distributions used for final evaluation. This setup risks the Transformer learning task-specific correlations in the supplied statistical features rather than transferable per-group adaptation rules. An explicit cross-task meta-training experiment or hold-out task would directly test this concern.

minor comments (2)

[Methods] The precise mathematical definitions of the input statistical features (gradient norms, momentum norms, correlations) and the exact architecture of the lightweight Transformer (number of layers, heads, hidden dimension) should be stated explicitly, ideally with pseudocode, to support reproducibility.
[Results] Tables reporting performance gains should include the measured computational overhead (wall-clock time or FLOPs) for each task so that the “moderate overhead” claim can be assessed quantitatively.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and indicating revisions where the manuscript will be updated to strengthen the presentation.

read point-by-point responses

Referee: [Meta-objective definition] Meta-objective definition (methods section): the objective directly includes loss decrease and generalization gap as additive terms. These quantities are downstream outcomes of the very optimization process whose learning-rate and weight-decay schedules are being modulated by the Transformer; this creates a potential circular dependence that is load-bearing for the meta-training procedure. A concrete diagnostic (e.g., ablation that replaces these terms with fixed targets or an analysis of fixed-point stability) is required.

Authors: We acknowledge the interdependence between the modulated hyperparameters and the meta-objective terms. The meta-training procedure is a form of bilevel optimization in which the inner loop applies the modulated schedules and the outer loop optimizes the Transformer to improve the resulting dynamics; this is not a fixed-point circularity but an explicit optimization over modulation policies. Nevertheless, to directly address the concern, the revised manuscript will include a new ablation that replaces the loss-decrease and generalization-gap terms with fixed scalar targets and reports the resulting modulation stability and downstream performance. revision: yes
Referee: [Results section] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance on five tasks but supplies no statistical significance tests, no protocol for hyperparameter search on the AdamW baseline, and no details on the number of random seeds or variance estimates. These omissions are load-bearing for the central empirical claim.

Authors: We agree that rigorous statistical reporting is necessary to support the empirical claims. The revised manuscript will report all main results as means and standard deviations over at least five independent random seeds, detail the grid-search protocol used to tune the AdamW baseline (learning rate and weight-decay ranges), and include paired t-test p-values comparing MetaAdamW against the tuned baseline on each task. revision: yes
Referee: [Experiments section] Training and evaluation protocol (experiments section): meta-training appears to occur on the same task distributions used for final evaluation. This setup risks the Transformer learning task-specific correlations in the supplied statistical features rather than transferable per-group adaptation rules. An explicit cross-task meta-training experiment or hold-out task would directly test this concern.

Authors: The input features are deliberately chosen as local, task-agnostic statistics (gradient norms, momentum norms, and pairwise correlations) that reflect per-group optimization dynamics rather than global task properties. Ablation results across five heterogeneous tasks already indicate that the learned modulation rules transfer within each task. A full cross-task meta-training protocol would require substantial additional compute and is outside the current scope; we will add an explicit limitations paragraph discussing this point and outlining it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical meta-optimizer (MetaAdamW) whose modulation module is trained via a composite meta-objective on statistical features from parameter groups. Evaluation relies on held-out validation metrics (loss, accuracy, perplexity) across five distinct tasks with ablations on features, grouping, and weighting. No equations reduce by construction to the inputs, no self-citation chains justify uniqueness or load-bearing premises, and no ansatz or known result is renamed as a derivation. The design is self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of the Transformer encoder in learning stable modulations and on the meta-objective producing useful gradients; these are introduced without first-principles derivation.

free parameters (1)

task-specific priorities
Used to scale regularization terms in the extended homoscedastic uncertainty weighting; values are chosen per task.

axioms (1)

domain assumption Statistical features (gradient norms, momentum norms, correlations) extracted per parameter group are sufficient for the attention module to produce beneficial modulation factors.
Invoked in the design of the feature extraction and attention encoder.

pith-pipeline@v0.9.0 · 5573 in / 1257 out tokens · 56881 ms · 2026-05-10T17:03:45.163219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Andrychowicz, M

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent, 2016

work page 2016
[2]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks, 2017. 6

work page 2017
[3]

Kendall, Y

A. Kendall, Y . Gal, and R. Cipolla. Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics, 2018

work page 2018
[4]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017

work page 2017
[5]

Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning, 2017

work page 2017
[6]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019

work page 2019
[7]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023

work page 2023
[8]

S. Wang, J. Sun, and Z. Xu. Hyperadam: A learnable task-adaptive adam for network training, 2018. A META-UPDATEALGORITHM Algorithm 2 computes the hypothetical parametersθ ′ using current Φ, evaluatesL meta, and backpropagates through the attention mod- ule. B TASK-SPECIFICHYPERPARAMETERS Table 4 reports the optimal hyperparameter configuration for each ta...

work page 2018

[1] [1]

Andrychowicz, M

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent, 2016

work page 2016

[2] [2]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks, 2017. 6

work page 2017

[3] [3]

Kendall, Y

A. Kendall, Y . Gal, and R. Cipolla. Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics, 2018

work page 2018

[4] [4]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017

work page 2017

[5] [5]

Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning, 2017

work page 2017

[6] [6]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019

work page 2019

[7] [7]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023

work page 2023

[8] [8]

S. Wang, J. Sun, and Z. Xu. Hyperadam: A learnable task-adaptive adam for network training, 2018. A META-UPDATEALGORITHM Algorithm 2 computes the hypothetical parametersθ ′ using current Φ, evaluatesL meta, and backpropagates through the attention mod- ule. B TASK-SPECIFICHYPERPARAMETERS Table 4 reports the optimal hyperparameter configuration for each ta...

work page 2018