arxiv: 2605.02317 · v2 · submitted 2026-05-04 · 💻 cs.AI · cs.LG

Recognition: unknown

Anon: Extrapolating Adaptivity Beyond SGD and Adam

Yiheng Zhang , Kaiyan Zhao , Shaowu Wu , Yiming Wang , Jiajun Wu , Leong Hou U , Steve Drew , Xiaoguang Niu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:20 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords adaptive optimizerSGDAdamtunable adaptivityincremental delay updateconvergence analysisdeep learning optimization

0 comments

The pith

Anon makes adaptivity a continuous real-valued knob that interpolates between SGD and Adam and extends past both.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the fixed positive adaptivity in preconditioners like those in Adam prevents optimizers from fitting diverse loss landscapes well. To remove this restriction, it introduces Anon whose single scalar adaptivity parameter can be set anywhere on the real line. When the parameter equals zero the method reduces to SGD; when it equals one it reduces to Adam; values outside that interval produce new behaviors. An incremental delay update rule replaces the hard maximum tracking of AMSGrad so that convergence proofs still hold for any real value of the parameter. The authors prove convergence in both convex and non-convex regimes and report consistent gains over existing methods on image classification, diffusion, and language-modeling benchmarks.

Core claim

Anon is an optimizer whose adaptivity coefficient is allowed to take any real value, thereby interpolating between SGD-like and Adam-like regimes and extrapolating beyond them. The method replaces AMSGrad-style hard maximum tracking with incremental delay update, a mechanism that preserves convergence across the full spectrum of adaptivity values under both convex and non-convex assumptions.

What carries the argument

Continuously tunable real-valued adaptivity parameter together with incremental delay update (IDU) that tracks gradient history without hard maxima.

If this is right

Setting the adaptivity parameter to zero recovers SGD behavior.
Setting the adaptivity parameter to one recovers Adam behavior.
Values outside the interval [0,1] produce optimizer dynamics not achievable by either SGD or Adam.
Convergence guarantees apply uniformly for any real adaptivity value in both convex and non-convex problems.
The single unified method outperforms separate state-of-the-art optimizers on image classification, diffusion, and language modeling tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hyperparameter search could shift from selecting an optimizer to selecting one continuous adaptivity value inside a single method.
The same tunable mechanism might be applied to other adaptive algorithms such as RMSprop or AdaGrad to create a family of extrapolating optimizers.
Negative adaptivity values could be tested as a deliberate anti-adaptive regime for highly noisy or non-stationary training landscapes.

Load-bearing premise

The main reason adaptive methods generalize worse than non-adaptive ones is that their preconditioners cannot adjust to varied optimization landscapes.

What would settle it

A controlled training run in which setting the adaptivity parameter to a negative value produces divergence or markedly worse final accuracy than standard Adam on the same architecture and data.

Figures

Figures reproduced from arXiv: 2605.02317 by Jiajun Wu, Kaiyan Zhao, Leong Hou U, Shaowu Wu, Steve Drew, Xiaoguang Niu, Yiheng Zhang, Yiming Wang.

**Figure 1.** Figure 1: Visualizing the Dilemma: The distribution of existing optimizers’ adaptivity and the preference of different tasks. Convergence constraints lock existing optimizers in suboptimal regimes. (a) γ = 1.5 (b) γ = 0.5 (c) γ = −0.5 view at source ↗

**Figure 2.** Figure 2: Trajectories of SGDM, Adam, AdaBelief, and Anon. The color change from deep red to deep blue represents the loss from high to low. And the loss landscape displayed is the result of scaling by Anon. More empirical experiments are shown in Appendix B.2 and D. path is between Adam and SGDM. And when γ = −0.5, the Anon descends along the x-axis and arrives at the optimal point. We can find that in the progres… view at source ↗

**Figure 3.** Figure 3: Comparison of Adam, AMSGrad, and Anon on a simple convex problem with noise. The setting of hyperparameters follows β1 < √ β2 and η(t) = 0.1/ √ t (Reddi et al., 2019). Image Generation with Diffusion Model We conduct image generation experiments on CIFAR-10 (Krizhevsky et al., 2009) with diffusion model. We search the learning rate in {0.1, 0.01, 0.001, 0.0001, 0.00001} for AdamW, AMSGrad, Anon, SGDM, and… view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity analysis on CIFAR-10. Comprehensive Analysis and Robustness From the results on CNNs, we observe that setting the learning rate corresponding to SGDM and applying a negative adaptivity leads to better generalization and higher accuracy. In contrast, setting the learning rate equivalent to Adam and using a positive adaptivity (γ ≥ 1) achieves SOTA results in diffusion models and… view at source ↗

**Figure 5.** Figure 5: Optimization trajectories of SGDM, Adam, and Anon with varying γ. The gradient from yellow to purple indicates decreasing loss values. Different learning rates are applied to clearly visualize the distinct update directions view at source ↗

**Figure 6.** Figure 6: 3D visualization of the benchmark functions. D. Benchmark Function Visualization To better understand optimizer behavior in complex landscapes, we visualize their trajectories on two classical benchmark functions: Rosenbrock and Rastrigin. Rosenbrock tests the ability to follow narrow, curved valleys to a global minimum at 3 https://github.com/openai/improved-diffusion 4 https://github.com/Liuhong99/Sophia… view at source ↗

**Figure 7.** Figure 7: Optimization trajectory comparison using searched hyperparameters. The grid shows Rastrigin (columns 1, 3) and Rosenbrock (columns 2, 4) functions. Anon (γ ≤ 0) tends to explore flatter regions, while Anon (γ ≥ 1) and adaptive methods converge quickly to sharp minima. 15 view at source ↗

read the original abstract

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Anon tries to make adaptivity a real-valued knob with IDU for convergence across all reals, but the moment bounds for extrapolation look like the load-bearing part that needs checking.

read the letter

Anon is an optimizer that lets you set a real-valued adaptivity parameter to blend or go beyond SGD and Adam behaviors, using incremental delay update to try to guarantee convergence everywhere. The new part is treating adaptivity as a continuous dial in R rather than something fixed or binary, plus the IDU which they say is more flexible than the max operation in AMSGrad. That could be useful if you want to experiment with different levels of adaptivity in one run. They back it with convergence proofs for convex and non-convex cases and show better results than current optimizers on CNNs for images, diffusion models, and language tasks. The soft spot is the extrapolation part. The stress-test concern is valid: when the parameter is outside [0,1], it's not clear how IDU prevents the preconditioner from becoming unbounded or negative. Standard analyses depend on bounded moments, and without explicit bounds on the delayed terms for arbitrary reals, the claims might not hold up. The paper probably tests mostly in the interpolating range, so the 'extrapolate beyond' is more of a theoretical selling point than a demonstrated one. The assumption that preconditioner adaptivity is the main cause of the generalization gap also feels incomplete, as other elements like batch size and regularization matter a lot. This is for people building or tuning optimizers for large models. A reader who cares about optimizer theory and wants to try new variants would find it worth reading. It has enough substance to deserve peer review, though the proofs will need careful checking. Recommendation: send it to referees to get feedback on whether the IDU really supports the full range of adaptivity values.

Referee Report

1 major / 1 minor

Summary. The paper proposes Anon, an optimizer with a continuously tunable real-valued adaptivity parameter that interpolates between SGD-like and Adam-like behavior and extrapolates beyond both. It introduces the Incremental Delay Update (IDU) mechanism to ensure convergence across the full adaptivity spectrum, claims theoretical convergence guarantees in convex and non-convex settings, and reports consistent empirical outperformance versus state-of-the-art optimizers on image classification, diffusion, and language modeling tasks.

Significance. If the IDU-based convergence analysis holds for arbitrary real adaptivity values, the work would offer a unified, tunable framework bridging non-adaptive and adaptive optimizers with potential practical value for large-scale training. The empirical results across diverse tasks add support, but significance is conditional on verifying that IDU robustly controls moments outside the conventional [0,1] range.

major comments (1)

[Abstract and theoretical analysis] Abstract and convergence analysis: the claim that IDU guarantees convergence for any real adaptivity parameter (including extrapolation outside [0,1]) is load-bearing for the central novelty, yet the provided description does not supply explicit bounds on the delayed-update term that would prevent unbounded growth or sign changes in the effective preconditioner; standard Adam/AMSGrad analyses rely on decay factors in (0,1) for moment control, and it is unclear whether IDU replaces this control without additional assumptions.

minor comments (1)

[Abstract] The acronym expansion 'Adaptivity Non-restricted Optimizer with Novel convergence technique' is slightly forced; a cleaner phrasing would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential of a tunable adaptivity framework. We address the major comment on the theoretical analysis below.

read point-by-point responses

Referee: [Abstract and theoretical analysis] Abstract and convergence analysis: the claim that IDU guarantees convergence for any real adaptivity parameter (including extrapolation outside [0,1]) is load-bearing for the central novelty, yet the provided description does not supply explicit bounds on the delayed-update term that would prevent unbounded growth or sign changes in the effective preconditioner; standard Adam/AMSGrad analyses rely on decay factors in (0,1) for moment control, and it is unclear whether IDU replaces this control without additional assumptions.

Authors: We thank the referee for this observation. The full convergence analysis (detailed in the appendix and referenced in Section 4) derives explicit bounds on the delayed-update term under IDU. For any real-valued adaptivity parameter, the incremental delay mechanism ensures the effective preconditioner remains positive and bounded by a quantity linear in the squared gradient norm, preventing unbounded growth. Sign changes are precluded because each update accumulates only non-negative increments with a delay scaled to the adaptivity value. This control replaces the (0,1) decay factor with a different but sufficient mechanism; the proof requires only the standard assumptions of bounded gradients and Lipschitz smoothness, with no further restrictions. To improve clarity, we will add a concise summary of these bounds to the main text along with an explicit pointer to the relevant theorem and proof. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical convergence claims rest on independent analysis of IDU mechanism

full rationale

The paper derives convergence guarantees for Anon under convex and non-convex settings by introducing the incremental delay update (IDU) mechanism, which is presented as a novel, more flexible alternative to AMSGrad's hard max. No equations or definitions in the provided abstract or description reduce the target result to a fitted parameter, self-referential definition, or self-citation chain; the adaptivity parameter is treated as an exogenous real-valued input, and IDU is analyzed directly for moment control across the spectrum. Empirical performance claims are separate from the theory and do not feed back into the proofs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available; no explicit free parameters, detailed axioms, or invented entities beyond the optimizer name and IDU mechanism are described.

axioms (1)

domain assumption Standard assumptions for convergence in convex and non-convex optimization settings
Invoked to establish theoretical guarantees for the full adaptivity spectrum

invented entities (2)

Anon optimizer no independent evidence
purpose: Provide continuously tunable adaptivity across and beyond SGD and Adam
Newly proposed framework in the paper
Incremental delay update (IDU) no independent evidence
purpose: Ensure convergence with flexible adaptivity and robustness to gradient noise
Novel mechanism introduced to replace hard max-tracking

pith-pipeline@v0.9.0 · 5536 in / 1341 out tokens · 69293 ms · 2026-05-09T16:20:51.966798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 15 canonical work pages · 4 internal anchors

[1]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

1901
[2]

and Gu, Q

Chen, J. and Gu, Q. Closing the generalization gap of adap- tive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763,

work page arXiv
[3]

On the conver- gence of a class of adam-type algorithms for non-convex optimization.arXiv preprint arXiv:1808.02941,

Chen, X., Liu, S., Sun, R., and Hong, M. On the conver- gence of a class of adam-type algorithms for non-convex optimization.arXiv preprint arXiv:1808.02941,

work page arXiv
[4]

Symbolic discovery of optimization algorithms,

URL https://arxiv.org/abs/ 2302.06675. Gokaslan, A. and Cohen, V . Openwebtext cor- pus. http://Skylion007.github.io/ OpenWebTextCorpus,

work page arXiv
[5]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Graves, A. Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850,

work page arXiv
[6]

Adam: A Method for Stochastic Optimization

URL https: //kellerjordan.github.io/posts/muon/. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023

Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342,

work page arXiv
[8]

On the variance of the adaptive learning rate and beyond

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

work page arXiv 1908
[9]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts.arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review arXiv
[10]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Adaptive gradient methods with dynamic bound of learning rate,

Luo, L., Xiong, Y ., Liu, Y ., and Sun, X. Adaptive gradient methods with dynamic bound of learning rate.arXiv preprint arXiv:1902.09843,

work page arXiv 1902
[12]

McMahan, H. B. and Streeter, M. Adaptive bound opti- mization for online convex optimization.arXiv preprint arXiv:1002.4908,

work page arXiv
[13]

On the Convergence of Adam and Beyond

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237,

work page Pith review arXiv 1904
[14]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Table 7.Validation loss on OpenWebText

Notably, Anon with γ= 1.15 exhibits the lowest similarity when compared to Adam, Lion (Chen et al., 2023), and Muon (Jordan et al., 2024), suggesting it discovers a unique solution. Table 7.Validation loss on OpenWebText. Anonγ=1 Anonγ=1.1 Anonγ=1.15 Adam Lion Muon Loss 2.9372.9272.932 2.934 2.992 3.092 C. Experimental Details and Additional Results C.1. ...

work page arXiv 2023
[16]

Table 9.Ablation study on the hyperparameters{a n}andβ 3 of IDU

These results demonstrate that IDU is robust to hyperparameter variations; indeed, certain configurations (e.g., β3 = 0.3, an = 4n−1) even outperform our default setting (β3 = 0.5, an = 2n−1). Table 9.Ablation study on the hyperparameters{a n}andβ 3 of IDU. β3 = 0.1β 3 = 0.3β 3 = 0.5β 3 = 0.7β 3 = 0.9 an = 2n−1 91.76 91.98 92.42 92.43 92.16 an = 3n−1 92.2...

work page arXiv
[17]

Theorem 3 in main paper For simplicity, we omit the debiasing step in theoretical analysis as in Reddi et al

(18) G. Theorem 3 in main paper For simplicity, we omit the debiasing step in theoretical analysis as in Reddi et al. (2019). It is easy to prove that the analysis also applys to the de-biased version. Lemma G.1.(McMahan & Streeter,

2019