Directional Consistency as a Complementary Optimization Signal: The GONO Framework

Victor Daniel Gera

arxiv: 2605.06575 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Directional Consistency as a Complementary Optimization Signal: The GONO Framework

Victor Daniel Gera This is my paper

Pith reviewed 2026-05-08 12:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GONO optimizerdirectional consistencymomentum adaptationAdam optimizergradient cosine similarityconvergence rateoscillation detectionneural network optimization

0 comments

The pith

GONO adapts Adam's momentum coefficient using cosine similarity of consecutive gradients while preserving the original convergence rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that directional alignment of gradients can occur independently of loss reduction, revealing a missed signal in standard optimizers. It proposes GONO to adjust Adam's beta_1 based on the cosine similarity of successive gradients, boosting momentum during consistent phases and damping it in oscillations. This maintains the O(1/sqrt(T)) convergence guarantee and reduces to Adam otherwise. The approach performs competitively on image datasets, suggesting directional consistency as a practical complement to magnitude-based adaptation.

Core claim

Directional alignment and loss convergence can be decoupled. An optimizer can show near-perfect directional consistency while loss decreases slowly. Existing optimizers lack mechanisms to exploit temporal consistency in gradient directions. GONO adapts Adam's momentum coefficient beta_1 based on cc_t, the consecutive gradient cosine similarity, amplifying momentum under consistency and suppressing during oscillation. GONO matches Adam's convergence rate and reduces to Adam when the signal is uninformative.

What carries the argument

The cc_t signal, defined as the cosine similarity between consecutive gradients, used to adapt beta_1 in the GONO optimizer.

Load-bearing premise

That directional consistency measured via consecutive gradient cosine similarity provides an independent and actionable signal for adapting beta_1 that improves or maintains performance without introducing instabilities or violating the conditions needed for the claimed convergence guarantee.

What would settle it

Observing that GONO diverges or fails to achieve the O(1/sqrt(T)) rate on a standard convex optimization problem where Adam succeeds.

Figures

Figures reproduced from arXiv: 2605.06575 by Victor Daniel Gera.

**Figure 1.** Figure 1: Three signals during a 300-epoch training run (Experiment 1). view at source ↗

**Figure 2.** Figure 2: Oscillation detection comparison (Experiment 2A). Red background = actual oscillating view at source ↗

**Figure 3.** Figure 3: Rosenbrock optimization (Experiment 2B). view at source ↗

**Figure 4.** Figure 4: GONO adaptive behavior on MNIST. Left: Terrain factor (β1,t/β1) over training steps. Values above 1.0 indicate momentum boosted above Adam default; values below 1.0 indicate damping. Right: Distribution of cct across all training steps; 5.4% of steps (left of red threshold) trigger oscillation damping. 5.5 Experiment 4: CIFAR-10 Classification Setup. Architecture: MLP with layers 3072 → 256 → 128 → 10 (ReL… view at source ↗

**Figure 5.** Figure 5: ResNet-18 on CIFAR-10 (Experiment 5). Left: Training loss. Center: Test accuracy. Right: Gradient agreement signal cct. GONO (75.44%) is competitive with AdamW (76.88%) and outperforms SGD-M (66.22%). GONO’s primary advantage is in structured scenarios where cct cleanly identifies the regime (oscillation detection F1 = 1.00; Rosenbrock valley traversal with confirmed cct < 0 signalling). On standard bench… view at source ↗

read the original abstract

We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via consecutive gradient cosine similarity) while the loss remains high or decreases slowly. This observation reveals that existing optimizers such as Adam, SGD, and RMSprop lack explicit mechanisms to exploit temporal consistency in gradient directions, relying instead on magnitude-based signals that fail to distinguish plateaus, saddle points, and genuine convergence. Motivated by this, we introduce GONO (Gradient-Oriented Norm-Adaptive Optimizer), which adapts Adam's momentum coefficient beta_1 based on cc_t: amplifying momentum under directional consistency and suppressing it during oscillation. We prove GONO matches Adam's O(1/sqrt(T)) convergence rate and reduces exactly to Adam when the signal is uninformative. Empirically, cc_t achieves oscillation detection with F1=1.00 (vs. 0.45 for gradient norm), and GONO remains competitive with AdamW on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%), establishing directional alignment as a theoretically grounded, practically actionable optimization signal. Code: https://github.com/victordaniel/gono-optimizer

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GONO adapts Adam via gradient directional consistency but the convergence proof needs checking on the time-varying beta_1.

read the letter

The main point on this GONO paper is that it adapts Adam's beta_1 using cosine similarity between consecutive gradients to boost momentum on consistent directions and damp it during oscillation, while claiming the same O(1/sqrt(T)) rate as Adam and falling back to the original optimizer when the signal is weak. This is new because prior Adam variants and the ones cited do not use temporal directional consistency as an explicit control signal. The paper does well by separating directional alignment from loss decrease as a distinct phenomenon and by showing cc_t detects oscillation with F1 of 1.00 versus 0.45 for gradient norm. The exact reduction to Adam when uninformative is a clean safeguard that keeps the method from adding risk in unhelpful regimes. The soft spot is the convergence argument. Standard Adam proofs lean on fixed beta_1 to control the moving average and apply bias-correction bounds. Replacing beta_1 with a gradient-dependent function of cc_t introduces a stochastic dependence that could generate new error terms or require bounds on how quickly the coefficient changes. The abstract states the rate is preserved but does not outline extra assumptions or a modified proof strategy, so the claim rests on whether the full derivation closes that gap. Experiments stay competitive on MNIST, CIFAR-10, and ResNet-18, yet the CIFAR result at 43 percent points to a modest setup and no variance or ablation details are given, which limits how much practical gain can be read from the numbers. This is the sort of targeted optimizer tweak that would interest people who tune training loops or study adaptive methods. It has enough structure and a safe fallback to deserve referee time, mainly to examine the analysis around the adapted beta_1. I would send it for review.

Referee Report

1 major / 3 minor

Summary. The manuscript identifies a decoupling between directional consistency of gradients (via consecutive cosine similarity cc_t) and loss convergence, noting that standard optimizers like Adam lack explicit mechanisms to exploit this. It introduces GONO, which adapts Adam's momentum coefficient β₁ based on cc_t to amplify momentum under consistency and suppress it during oscillations. The authors prove that GONO achieves the same O(1/√T) convergence rate as Adam in non-convex stochastic optimization and reduces exactly to Adam when the consistency signal is uninformative. Empirically, cc_t yields F1=1.00 for oscillation detection, and GONO shows competitive results on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%).

Significance. If the convergence guarantee is established for the adaptive, gradient-dependent β₁, the work would be significant for formalizing directional consistency as a complementary, theoretically grounded signal beyond magnitude-based adaptations. The exact reduction to Adam when uninformative is a strength that preserves reliability. The oscillation detection result is compelling, but gains on standard benchmarks are incremental, so broader impact hinges on whether the approach yields advantages on harder problems or provides new insights into saddle/plateau behavior.

major comments (1)

[Convergence analysis] Convergence analysis (theory section containing the proof of the O(1/√T) rate): The claim that GONO matches Adam's convergence rate is load-bearing for the central contribution, but the adaptation replaces fixed β₁ with a time-varying β₁_t derived from cc_t = cosine(g_t, g_{t-1}), where both g_t and g_{t-1} are stochastic. Standard Adam analyses rely on constant β₁ < 1 to control bias-correction terms and apply bounding lemmas on the momentum difference. No additional assumptions on the Lipschitz continuity or bounded variation of β₁_t are stated, and the reduction-to-Adam case covers only the uninformative regime. This leaves open whether new error terms arise that could affect the rate; the proof must be extended or the conditions clarified.

minor comments (3)

[Experiments] Experimental section: Reported accuracies lack error bars, number of runs, or statistical tests, and hyperparameter details for the adaptation (e.g., how cc_t thresholds map to β₁) are not fully specified. This limits assessment of whether GONO is reliably competitive.
Notation and presentation: The quantity cc_t should be introduced with a numbered equation in the main text (not only the abstract) and its computation (including handling of g_{t-1}) made explicit to avoid ambiguity.
[Experiments] The CIFAR-10 result of 43.14% appears low relative to typical ResNet performance; clarify the exact architecture, training protocol, and whether this is top-1 accuracy under the stated setup.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The primary concern regarding the convergence analysis is addressed below. We agree that additional rigor is needed and will revise the manuscript accordingly to strengthen the theoretical contribution.

read point-by-point responses

Referee: [Convergence analysis] Convergence analysis (theory section containing the proof of the O(1/√T) rate): The claim that GONO matches Adam's convergence rate is load-bearing for the central contribution, but the adaptation replaces fixed β₁ with a time-varying β₁_t derived from cc_t = cosine(g_t, g_{t-1}), where both g_t and g_{t-1} are stochastic. Standard Adam analyses rely on constant β₁ < 1 to control bias-correction terms and apply bounding lemmas on the momentum difference. No additional assumptions on the Lipschitz continuity or bounded variation of β₁_t are stated, and the reduction-to-Adam case covers only the uninformative regime. This leaves open whether new error terms arise that could affect the rate; the proof must be extended or the conditions clarified.

Authors: We thank the referee for this precise observation. We acknowledge that the current proof sketch does not fully detail the handling of stochastic, gradient-dependent β₁_t and that standard Adam analyses assume constant β₁. In the revised manuscript, we will extend the convergence analysis as follows: (1) state the standard assumption of bounded stochastic gradient norms (‖g_t‖ ≤ G); (2) note that cc_t ∈ [-1, 1] implies β₁_t ∈ [β_low, β_high] with β_high < 1, permitting the same bias-correction and momentum-difference bounding lemmas with only minor adjustments; (3) introduce a Lipschitz continuity assumption on the loss to bound the variation of β₁_t and show that the resulting additive error terms remain O(1/T) and do not change the O(1/√T) rate. The exact reduction to Adam when the signal is uninformative is retained as a special case, while the general bound holds uniformly. These clarifications and the extended proof will be added to the theory section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; adaptation rule and convergence claim are self-contained

full rationale

The GONO framework defines its core adaptation of beta_1 directly from the observable quantity cc_t (consecutive gradient cosine similarity), which is computed from the same stochastic gradients used by any first-order optimizer. The paper explicitly states that the method reduces exactly to Adam when the directional signal is uninformative, making this reduction a deliberate design choice rather than a derived result. The claimed O(1/sqrt(T)) convergence guarantee is presented as a proof that extends Adam's standard analysis, not as a statistical fit or renaming of an input pattern. No self-citations, fitted parameters, or ansatzes are invoked to justify the central claims; the derivation chain relies on the explicit adaptation rule and standard non-convex stochastic optimization assumptions without reducing the target result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the paper relies on standard assumptions underlying Adam's convergence analysis for its proof. Specific details of the adaptation function mapping cc_t to beta_1 (e.g., any scaling or threshold parameters) are not provided and would constitute free parameters if present. No new entities are postulated.

axioms (1)

standard math Standard assumptions for Adam-like optimizers such as bounded gradients or Lipschitz continuity of the loss.
Invoked implicitly to claim matching the O(1/sqrt(T)) convergence rate of Adam.

pith-pipeline@v0.9.0 · 5528 in / 1489 out tokens · 42441 ms · 2026-05-08T12:18:19.191081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

The Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=. 1951 , publisher=

work page 1951
[2]

International Conference on Learning Representations , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , year=

work page
[3]

Tieleman, Tijmen and Hinton, Geoffrey , howpublished=

work page
[4]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[5]

On the Convergence of

Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv , booktitle=. On the Convergence of

work page
[6]

SIAM Journal on Optimization , volume=

Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013
[7]

Advances in Neural Information Processing Systems , volume=

Gradient Surgery for Multi-Task Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Conflict-Averse Gradient Descent for Multi-task Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Advances in Neural Information Processing Systems , volume=

Lookahead Optimizer: k Steps Forward, 1 Step Back , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

International Conference on Learning Representations , year=

On the Variance of the Adaptive Learning Rate and Beyond , author=. International Conference on Learning Representations , year=

work page
[11]

2018 , organization=

Bernstein, Jeremy and Wang, Yu-Xiang and Azizzadenesheli, Kamyar and Anandkumar, Animashree , booktitle=. 2018 , organization=

work page 2018
[12]

Advances in Neural Information Processing Systems , volume=

Symbolic Discovery of Optimization Algorithms , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

International Conference on Learning Representations , year=

Sharpness-Aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[14]

International Conference on Learning Representations , year=

Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-training , author=. International Conference on Learning Representations , year=

work page
[15]

Hindsight-Guided Momentum (

Sarkar, Krisanu , journal=. Hindsight-Guided Momentum (

work page
[16]

BC, Samiksha , journal=

work page
[17]

Sufficient Conditions for Convergence of the

Zou, Fangyu and Shen, Li and Jie, Zequn and Zhang, Weizhong and Liu, Wei , booktitle=. Sufficient Conditions for Convergence of the

work page

[1] [1]

The Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=. 1951 , publisher=

work page 1951

[2] [2]

International Conference on Learning Representations , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , year=

work page

[3] [3]

Tieleman, Tijmen and Hinton, Geoffrey , howpublished=

work page

[4] [4]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page

[5] [5]

On the Convergence of

Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv , booktitle=. On the Convergence of

work page

[6] [6]

SIAM Journal on Optimization , volume=

Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013

[7] [7]

Advances in Neural Information Processing Systems , volume=

Gradient Surgery for Multi-Task Learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[8] [8]

Advances in Neural Information Processing Systems , volume=

Conflict-Averse Gradient Descent for Multi-task Learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [9]

Advances in Neural Information Processing Systems , volume=

Lookahead Optimizer: k Steps Forward, 1 Step Back , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [10]

International Conference on Learning Representations , year=

On the Variance of the Adaptive Learning Rate and Beyond , author=. International Conference on Learning Representations , year=

work page

[11] [11]

2018 , organization=

Bernstein, Jeremy and Wang, Yu-Xiang and Azizzadenesheli, Kamyar and Anandkumar, Animashree , booktitle=. 2018 , organization=

work page 2018

[12] [12]

Advances in Neural Information Processing Systems , volume=

Symbolic Discovery of Optimization Algorithms , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [13]

International Conference on Learning Representations , year=

Sharpness-Aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page

[14] [14]

International Conference on Learning Representations , year=

Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-training , author=. International Conference on Learning Representations , year=

work page

[15] [15]

Hindsight-Guided Momentum (

Sarkar, Krisanu , journal=. Hindsight-Guided Momentum (

work page

[16] [16]

BC, Samiksha , journal=

work page

[17] [17]

Sufficient Conditions for Convergence of the

Zou, Fangyu and Shen, Li and Jie, Zequn and Zhang, Weizhong and Liu, Wei , booktitle=. Sufficient Conditions for Convergence of the

work page