Greedy Alignment Principle for Optimizer Selection

Jaerin Lee; Kyoung Mu Lee

arxiv: 2512.06370 · v3 · submitted 2025-12-06 · 💻 cs.LG · stat.ML

Greedy Alignment Principle for Optimizer Selection

Jaerin Lee , Kyoung Mu Lee This is my paper

Pith reviewed 2026-05-17 01:11 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords optimizer selectiongradient alignmentmomentum optimizerdynamic hyperparametersSGDAdamcausal filterhyperparameter tuning

0 comments

The pith

The expected loss drop from an optimizer equals the inner product of its filter with the gradient autocorrelation, allowing greedy selection of momentum rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models optimizers as causal filters that transform gradients into updates. It shows that maximizing the expected loss reduction over a family of such filters is equivalent to finding the filter that has the largest inner product with the observed gradient autocorrelation. This leads to a greedy algorithm for selecting optimizer hyperparameters, with a proven stability bound. Specializing to momentum optimizers produces simple rules for dynamically choosing momentum values in SGD and Adam during training. Experiments confirm these dynamic rules perform as well as or better than the best fixed values found by grid search, across image classification and language model tasks.

Core claim

Optimizer selection is formulated as maximizing the expected drop rate in loss, which equals the inner product between the optimizer filter and the gradient autocorrelation; a greedy optimum exists and remains stable under perturbations of the gradient statistics.

What carries the argument

The optimizer modeled as a causal filter whose expected loss contribution is the inner product with gradient autocorrelation.

Load-bearing premise

Gradients and updates behave as stationary signals, and the expected loss drop is exactly captured by the inner product between the optimizer filter and gradient autocorrelation.

What would settle it

An experiment where gradient autocorrelation is estimated from early training but then statistics shift dramatically mid-training, checking if the selected momentum causes worse performance than a fixed alternative.

Figures

Figures reproduced from arXiv: 2512.06370 by Jaerin Lee, Kyoung Mu Lee.

**Figure 1.** Figure 1: Just as optimizers train their models by feeding them parameter velocities ˙θ, models can also fit the optimizers to the underlying tasks by feeding gradients g. i.e., the optimal learning power is the convex conjugate of the indicator and also the gauge of the polar, while the conjugate of the gauge is the indicator of the polar. (iii) (Construction): An optimal optimizer Q⋆ ∈ arg maxQ∈Q Tr(QΣ) is a subgr… view at source ↗

**Figure 2.** Figure 2: Behavior of optimal optimizers under different types of trust regions. (a, d) Dotted lines are suboptimal optimizers with random Σ in an equal-power Frobenius family; the straight line shows the optimal optimizer found by our theory, achieving fastest convergence. (b, c, e, f) No free lunch theorem: Frobenius family excels for simple elliptic losses, while spectral and diagonal families excel for nonconvex… view at source ↗

**Figure 3.** Figure 3: Demonstration of Corollaries 3.3 and 3.4. Our instantiations of optimal optimizers are compared with baselines having fixed hyperparameters on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016), following the standard settings of (He et al., 2016). The error bars indicate the mean and standard deviation over 10 runs. Our instantiation shows better performance than every baseline opti… view at source ↗

**Figure 4.** Figure 4: Demonstration of Corollaries 3.3 and 3.4. Our instantiations of optimal optimizers are compared with baselines having fixed hyperparameters on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016), following the standard settings of (He et al., 2016). The line and shaded area indicate the mean and standard deviation over 10 runs. For clear visualization, each baseline plot shows only th… view at source ↗

**Figure 5.** Figure 5: Demonstration of effectiveness of validation-aware design of gradient-based optimizers. The validation-aware optimizers achieve the highest test accuracy among all optimizers. The SGD+M optimizer is trained on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016). where ˙θtr[n] = (q ∗ gtr)[n] is the parameter velocity guided solely by the training set, just like how we typically do in m… view at source ↗

read the original abstract

Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across image classification, language model fine-tuning, and vision transformer fine-tuning show that the resulting dynamic momentum rules match or improve upon the best fixed hyperparameters found via manual sweeps, reducing the need for exhaustive momentum sweeps. Code is available at https://github.com/ironjr/gap

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Derives dynamic momentum rules for SGD and Adam from a greedy alignment objective but depends on a stationarity assumption that may not hold in practice.

read the letter

The main takeaway is that this paper turns the gradient-update alignment observation into a formal selection principle and extracts concrete dynamic momentum rules for SGD+Momentum and Adam/AdamW directly from an inner-product objective on expected loss drop. They model the optimizer as a causal filter, show the objective equals the inner product with the gradient autocorrelation, and prove a greedy optimum exists along with a stability bound under perturbations of the estimated statistics. The experiments on image classification, language-model fine-tuning, and vision-transformer fine-tuning indicate the resulting rules match or beat the best fixed momentum values located by manual sweeps, which could reduce tuning effort. That part is new and cleanly executed on paper. The derivation itself looks independent of the final results and gives a reproducible way to arrive at the update rules. The experiments are relevant because they test across different domains rather than a single benchmark. The soft spot is the stationarity assumption required for the inner-product equivalence. Gradients in deep-network training are rarely wide-sense stationary over any reasonable window, so the exact link to expected loss drop becomes an approximation whose error is not bounded by the given stability result. The paper addresses perturbations of the estimates but does not quantify how much the non-stationarity hurts the dynamic rules in practice. I would want to see the exact online estimation procedure for the autocorrelation and whether performance degrades when the assumption is stressed. This is aimed at researchers who tune optimizers or want a principled alternative to grid search on momentum. A reader interested in grounding alignment heuristics mathematically will find the derivation useful even if they end up treating the rules as a heuristic. The formal steps plus the multi-domain experiments are enough to justify sending it to a serious referee rather than desk-rejecting it. I would recommend peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Greedy Alignment Principle (GAP) for optimizer selection. Treating gradients and updates as signals and the optimizer as a causal filter, it formulates hyperparameter choice as maximization of expected loss drop. Under a stationarity assumption this objective equals the inner product between the filter and the gradient autocorrelation function. The paper proves existence of a greedy optimum together with a stability bound under perturbations of the estimated statistics. Specializing to momentum-based methods yields simple dynamic momentum rules for SGD+Momentum and Adam/AdamW. Experiments on image classification, language-model fine-tuning and vision-transformer fine-tuning indicate that the dynamic rules match or exceed the best fixed hyperparameters obtained by manual sweeps.

Significance. If the modeling assumptions hold approximately, the work supplies a mathematically grounded alternative to exhaustive hyperparameter sweeps for momentum. Strengths include the direct derivation of the inner-product objective from expected loss drop, the accompanying stability proof, and the public release of reproducible code. The approach formalizes an existing heuristic and could reduce tuning cost in large-scale training.

major comments (2)

[Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.
[Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.

minor comments (2)

[Abstract] Abstract: the phrase 'simple dynamic momentum selection rules' is used without a concrete formula or pseudocode; a brief illustrative equation would improve immediate readability.
[Preliminaries] Notation: the causal-filter representation and the definition of the autocorrelation function would benefit from an early, self-contained equation block before the main derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.

Authors: We acknowledge that the wide-sense stationarity assumption is key to equating the expected loss drop to the inner product with the gradient autocorrelation function. Deriving a rigorous error bound for the fully non-stationary case would necessitate substantial additional theoretical development, which we consider outside the primary scope of this work. However, we agree that empirical validation is valuable. In the revised manuscript, we will include diagnostics by estimating the autocorrelation over sliding windows on the datasets used in our experiments (CIFAR-10, GLUE, etc.) and report the variation in second-order statistics to assess the validity of the local stationarity approximation. revision: partial
Referee: [Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.

Authors: We appreciate this observation. The dynamic rules are indeed computed online, using only gradient information available up to the current training step. In the revised version, we will add a detailed description of the experimental protocol, including the exact ranges and grids used for the manual sweeps of fixed momentum values, the number of independent runs (three seeds per configuration), and explicit confirmation that the dynamic selection operates causally without peeking at future data or reusing the validation set for tuning. These additions will clarify the comparison and strengthen the evidence for reduced tuning effort. revision: yes

Circularity Check

0 steps flagged

Derivation from expected loss drop to inner-product objective is algebraic and self-contained

full rationale

The paper begins with an externally motivated objective (maximize expected loss drop over a family of optimizers) and algebraically shows equivalence to the inner product of the causal filter with the gradient autocorrelation under an explicit wide-sense stationarity assumption on gradients. The existence of a greedy optimum and its perturbation stability bound are then proved directly from this inner-product form. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or reduces the central claim to its own inputs by construction. The stationarity assumption is stated as a modeling choice whose validity is separate from the algebraic steps; the derivation chain therefore remains independent of experimental outcomes and self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the signal-filter analogy and the existence of a greedy optimum with a stability bound; no free parameters are introduced beyond the prescribed family of optimizers, and no new entities are postulated.

axioms (1)

domain assumption Gradients and updates can be treated as signals and an optimizer as a causal filter that maps between them.
This modeling choice is invoked to formulate optimizer selection as maximization of expected loss drop and to reach the inner-product equivalence.

pith-pipeline@v0.9.0 · 5468 in / 1354 out tokens · 64911 ms · 2026-05-17T01:11:15.522095+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation... P(Q;n) := E[g[n]⊤ θ̇[n]] = <Q, R>_H
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Optimal dynamic optimizers under convex constraints)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2080

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2080