pith. sign in

arxiv: 2512.06370 · v3 · submitted 2025-12-06 · 💻 cs.LG · stat.ML

Greedy Alignment Principle for Optimizer Selection

Pith reviewed 2026-05-17 01:11 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords optimizer selectiongradient alignmentmomentum optimizerdynamic hyperparametersSGDAdamcausal filterhyperparameter tuning
0
0 comments X

The pith

The expected loss drop from an optimizer equals the inner product of its filter with the gradient autocorrelation, allowing greedy selection of momentum rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models optimizers as causal filters that transform gradients into updates. It shows that maximizing the expected loss reduction over a family of such filters is equivalent to finding the filter that has the largest inner product with the observed gradient autocorrelation. This leads to a greedy algorithm for selecting optimizer hyperparameters, with a proven stability bound. Specializing to momentum optimizers produces simple rules for dynamically choosing momentum values in SGD and Adam during training. Experiments confirm these dynamic rules perform as well as or better than the best fixed values found by grid search, across image classification and language model tasks.

Core claim

Optimizer selection is formulated as maximizing the expected drop rate in loss, which equals the inner product between the optimizer filter and the gradient autocorrelation; a greedy optimum exists and remains stable under perturbations of the gradient statistics.

What carries the argument

The optimizer modeled as a causal filter whose expected loss contribution is the inner product with gradient autocorrelation.

Load-bearing premise

Gradients and updates behave as stationary signals, and the expected loss drop is exactly captured by the inner product between the optimizer filter and gradient autocorrelation.

What would settle it

An experiment where gradient autocorrelation is estimated from early training but then statistics shift dramatically mid-training, checking if the selected momentum causes worse performance than a fixed alternative.

Figures

Figures reproduced from arXiv: 2512.06370 by Jaerin Lee, Kyoung Mu Lee.

Figure 1
Figure 1. Figure 1: Just as optimizers train their models by feeding them parameter velocities ˙θ, models can also fit the optimizers to the underlying tasks by feeding gradients g. i.e., the optimal learning power is the convex conjugate of the indicator and also the gauge of the polar, while the conjugate of the gauge is the indicator of the polar. (iii) (Construction): An optimal optimizer Q⋆ ∈ arg maxQ∈Q Tr(QΣ) is a subgr… view at source ↗
Figure 2
Figure 2. Figure 2: Behavior of optimal optimizers under different types of trust regions. (a, d) Dotted lines are suboptimal optimizers with random Σ in an equal-power Frobenius family; the straight line shows the optimal optimizer found by our theory, achieving fastest convergence. (b, c, e, f) No free lunch theorem: Frobenius family excels for simple elliptic losses, while spectral and diagonal families excel for nonconvex… view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of Corollaries 3.3 and 3.4. Our instantiations of optimal optimizers are compared with baselines having fixed hyperparameters on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016), following the standard settings of (He et al., 2016). The error bars indicate the mean and standard deviation over 10 runs. Our instantiation shows better performance than every baseline opti… view at source ↗
Figure 4
Figure 4. Figure 4: Demonstration of Corollaries 3.3 and 3.4. Our instantiations of optimal optimizers are compared with baselines having fixed hyperparameters on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016), following the standard settings of (He et al., 2016). The line and shaded area indicate the mean and standard deviation over 10 runs. For clear visualization, each baseline plot shows only th… view at source ↗
Figure 5
Figure 5. Figure 5: Demonstration of effectiveness of validation-aware design of gradient-based optimizers. The validation-aware optimizers achieve the highest test accuracy among all optimizers. The SGD+M optimizer is trained on the CIFAR-100 dataset (Krizhevsky, 2009) with ResNet-18 (He et al., 2016). where ˙θtr[n] = (q ∗ gtr)[n] is the parameter velocity guided solely by the training set, just like how we typically do in m… view at source ↗
read the original abstract

Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across image classification, language model fine-tuning, and vision transformer fine-tuning show that the resulting dynamic momentum rules match or improve upon the best fixed hyperparameters found via manual sweeps, reducing the need for exhaustive momentum sweeps. Code is available at https://github.com/ironjr/gap

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Greedy Alignment Principle (GAP) for optimizer selection. Treating gradients and updates as signals and the optimizer as a causal filter, it formulates hyperparameter choice as maximization of expected loss drop. Under a stationarity assumption this objective equals the inner product between the filter and the gradient autocorrelation function. The paper proves existence of a greedy optimum together with a stability bound under perturbations of the estimated statistics. Specializing to momentum-based methods yields simple dynamic momentum rules for SGD+Momentum and Adam/AdamW. Experiments on image classification, language-model fine-tuning and vision-transformer fine-tuning indicate that the dynamic rules match or exceed the best fixed hyperparameters obtained by manual sweeps.

Significance. If the modeling assumptions hold approximately, the work supplies a mathematically grounded alternative to exhaustive hyperparameter sweeps for momentum. Strengths include the direct derivation of the inner-product objective from expected loss drop, the accompanying stability proof, and the public release of reproducible code. The approach formalizes an existing heuristic and could reduce tuning cost in large-scale training.

major comments (2)
  1. [Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.
  2. [Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'simple dynamic momentum selection rules' is used without a concrete formula or pseudocode; a brief illustrative equation would improve immediate readability.
  2. [Preliminaries] Notation: the causal-filter representation and the definition of the autocorrelation function would benefit from an early, self-contained equation block before the main derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.

    Authors: We acknowledge that the wide-sense stationarity assumption is key to equating the expected loss drop to the inner product with the gradient autocorrelation function. Deriving a rigorous error bound for the fully non-stationary case would necessitate substantial additional theoretical development, which we consider outside the primary scope of this work. However, we agree that empirical validation is valuable. In the revised manuscript, we will include diagnostics by estimating the autocorrelation over sliding windows on the datasets used in our experiments (CIFAR-10, GLUE, etc.) and report the variation in second-order statistics to assess the validity of the local stationarity approximation. revision: partial

  2. Referee: [Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.

    Authors: We appreciate this observation. The dynamic rules are indeed computed online, using only gradient information available up to the current training step. In the revised version, we will add a detailed description of the experimental protocol, including the exact ranges and grids used for the manual sweeps of fixed momentum values, the number of independent runs (three seeds per configuration), and explicit confirmation that the dynamic selection operates causally without peeking at future data or reusing the validation set for tuning. These additions will clarify the comparison and strengthen the evidence for reduced tuning effort. revision: yes

Circularity Check

0 steps flagged

Derivation from expected loss drop to inner-product objective is algebraic and self-contained

full rationale

The paper begins with an externally motivated objective (maximize expected loss drop over a family of optimizers) and algebraically shows equivalence to the inner product of the causal filter with the gradient autocorrelation under an explicit wide-sense stationarity assumption on gradients. The existence of a greedy optimum and its perturbation stability bound are then proved directly from this inner-product form. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or reduces the central claim to its own inputs by construction. The stationarity assumption is stated as a modeling choice whose validity is separate from the algebraic steps; the derivation chain therefore remains independent of experimental outcomes and self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the signal-filter analogy and the existence of a greedy optimum with a stability bound; no free parameters are introduced beyond the prescribed family of optimizers, and no new entities are postulated.

axioms (1)
  • domain assumption Gradients and updates can be treated as signals and an optimizer as a causal filter that maps between them.
    This modeling choice is invoked to formulate optimizer selection as maximization of expected loss drop and to reach the inner-product equivalence.

pith-pipeline@v0.9.0 · 5468 in / 1354 out tokens · 64911 ms · 2026-05-17T01:11:15.522095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...