Greedy Alignment Principle for Optimizer Selection
Pith reviewed 2026-05-17 01:11 UTC · model grok-4.3
The pith
The expected loss drop from an optimizer equals the inner product of its filter with the gradient autocorrelation, allowing greedy selection of momentum rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimizer selection is formulated as maximizing the expected drop rate in loss, which equals the inner product between the optimizer filter and the gradient autocorrelation; a greedy optimum exists and remains stable under perturbations of the gradient statistics.
What carries the argument
The optimizer modeled as a causal filter whose expected loss contribution is the inner product with gradient autocorrelation.
Load-bearing premise
Gradients and updates behave as stationary signals, and the expected loss drop is exactly captured by the inner product between the optimizer filter and gradient autocorrelation.
What would settle it
An experiment where gradient autocorrelation is estimated from early training but then statistics shift dramatically mid-training, checking if the selected momentum causes worse performance than a fixed alternative.
Figures
read the original abstract
Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across image classification, language model fine-tuning, and vision transformer fine-tuning show that the resulting dynamic momentum rules match or improve upon the best fixed hyperparameters found via manual sweeps, reducing the need for exhaustive momentum sweeps. Code is available at https://github.com/ironjr/gap
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Greedy Alignment Principle (GAP) for optimizer selection. Treating gradients and updates as signals and the optimizer as a causal filter, it formulates hyperparameter choice as maximization of expected loss drop. Under a stationarity assumption this objective equals the inner product between the filter and the gradient autocorrelation function. The paper proves existence of a greedy optimum together with a stability bound under perturbations of the estimated statistics. Specializing to momentum-based methods yields simple dynamic momentum rules for SGD+Momentum and Adam/AdamW. Experiments on image classification, language-model fine-tuning and vision-transformer fine-tuning indicate that the dynamic rules match or exceed the best fixed hyperparameters obtained by manual sweeps.
Significance. If the modeling assumptions hold approximately, the work supplies a mathematically grounded alternative to exhaustive hyperparameter sweeps for momentum. Strengths include the direct derivation of the inner-product objective from expected loss drop, the accompanying stability proof, and the public release of reproducible code. The approach formalizes an existing heuristic and could reduce tuning cost in large-scale training.
major comments (2)
- [Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.
- [Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.
minor comments (2)
- [Abstract] Abstract: the phrase 'simple dynamic momentum selection rules' is used without a concrete formula or pseudocode; a brief illustrative equation would improve immediate readability.
- [Preliminaries] Notation: the causal-filter representation and the definition of the autocorrelation function would benefit from an early, self-contained equation block before the main derivation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Derivation of inner-product objective] The equivalence of expected loss drop to the inner product with gradient autocorrelation (derivation section) is derived under the assumption that gradients are wide-sense stationary over the estimation window. This assumption is load-bearing for both the greedy rule and the stability bound. In deep-network training gradient second-order statistics typically shift across epochs and even within epochs; the manuscript should either derive an error bound for the non-stationary case or provide empirical diagnostics of stationarity on the reported datasets.
Authors: We acknowledge that the wide-sense stationarity assumption is key to equating the expected loss drop to the inner product with the gradient autocorrelation function. Deriving a rigorous error bound for the fully non-stationary case would necessitate substantial additional theoretical development, which we consider outside the primary scope of this work. However, we agree that empirical validation is valuable. In the revised manuscript, we will include diagnostics by estimating the autocorrelation over sliding windows on the datasets used in our experiments (CIFAR-10, GLUE, etc.) and report the variation in second-order statistics to assess the validity of the local stationarity approximation. revision: partial
-
Referee: [Experiments] Experiments section: the claim that dynamic rules reduce the need for exhaustive sweeps rests on comparisons to the best fixed hyperparameters found by manual sweeps. The manuscript does not report the exact sweep ranges, number of independent runs, or whether the dynamic selection was performed online versus on the same data used for the fixed baseline. These controls are necessary to substantiate the practical advantage.
Authors: We appreciate this observation. The dynamic rules are indeed computed online, using only gradient information available up to the current training step. In the revised version, we will add a detailed description of the experimental protocol, including the exact ranges and grids used for the manual sweeps of fixed momentum values, the number of independent runs (three seeds per configuration), and explicit confirmation that the dynamic selection operates causally without peeking at future data or reusing the validation set for tuning. These additions will clarify the comparison and strengthen the evidence for reduced tuning effort. revision: yes
Circularity Check
Derivation from expected loss drop to inner-product objective is algebraic and self-contained
full rationale
The paper begins with an externally motivated objective (maximize expected loss drop over a family of optimizers) and algebraically shows equivalence to the inner product of the causal filter with the gradient autocorrelation under an explicit wide-sense stationarity assumption on gradients. The existence of a greedy optimum and its perturbation stability bound are then proved directly from this inner-product form. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or reduces the central claim to its own inputs by construction. The stationarity assumption is stated as a modeling choice whose validity is separate from the algebraic steps; the derivation chain therefore remains independent of experimental outcomes and self-referential loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradients and updates can be treated as signals and an optimizer as a causal filter that maps between them.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation... P(Q;n) := E[g[n]⊤ θ̇[n]] = <Q, R>_H
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1 (Optimal dynamic optimizers under convex constraints)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.