Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Meng Zhu; Quan Xiao; Weidong Min

arxiv: 2605.20533 · v1 · pith:LCGDKACGnew · submitted 2026-05-19 · 💻 cs.LG

Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Meng Zhu , Quan Xiao , Weidong Min This is my paper

Pith reviewed 2026-05-21 06:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords hybrid optimizationAda2MSsecond-moment estimatesAdamWmomentum SGDmachine learning optimizersvisual tasks

0 comments

The pith

Ada2MS uses exponential interpolation between elementwise and global second-moment estimates to transition between AdamW and momentum SGD behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ada2MS as a hybrid optimization algorithm that smoothly blends the characteristics of AdamW and momentum SGD. It does this by continuously mixing elementwise second-moment estimates with global ones using an exponential schedule. This approach seeks to combine AdamW's stability and robustness with the better generalization often seen in momentum methods. On visual tasks, it shows competitive performance under a standard comparison setup. A sympathetic reader would care because it offers a way to potentially reduce the trade-off between stability and generalization without heavy tuning.

Core claim

Ada2MS achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates, obtaining competitive results on the visual tasks evaluated.

What carries the argument

The exponential mixing mechanism between elementwise second-moment estimates and global second-moment estimates, which enables the smooth behavioral transition from per-parameter adaptation to global scaling.

Load-bearing premise

A single exponential mixing schedule between elementwise and global second moments will reliably combine the generalization benefits of momentum SGD with the stability of AdamW across the tested visual tasks without introducing new instabilities.

What would settle it

An experiment showing that on the evaluated visual tasks, Ada2MS performs significantly worse than both AdamW and momentum SGD, or requires per-task retuning of the mixing parameter to achieve competitive results.

Figures

Figures reproduced from arXiv: 2605.20533 by Meng Zhu, Quan Xiao, Weidong Min.

read the original abstract

Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient-scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer-comparison protocol. The code will be released at https://github.com/mengzhu0308/Ada2MS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ada2MS defines a new exponential interpolation between elementwise and global second-moment estimates, but the global limit still normalizes by sqrt(v) and does not recover momentum SGD updates.

read the letter

The main takeaway is that Ada2MS mixes per-parameter second moments with a single global one through an exponential schedule. This specific mixing rule is the concrete new piece, and the authors test it on visual tasks under a shared protocol while planning to release code. That combination of a targeted variant plus reproducibility steps is the useful part here. They lay out the usual motivation cleanly: AdamW gives stable but sometimes weaker generalization, while momentum SGD can generalize better yet reacts more to gradient scale changes. The interpolation is a direct attempt to thread between those two regimes. The experiments follow a unified comparison setup, which avoids some common cherry-picking problems in optimizer papers. The code release is also a practical plus for anyone who wants to try it. The soft spot is the claimed endpoint behavior. Even at full global mixing the update remains g divided by the square root of the global second-moment estimate. Momentum SGD uses plain momentum accumulation with no second-moment normalization at all, so the limit does not actually produce momentum-SGD-like steps. The framing therefore overstates how close the method gets to the unnormalized case. This is a description issue rather than a show-stopper for the algorithm itself, but it should be corrected. The paper does not appear to rest on circular fitting or hidden parameters beyond the explicit mixing schedule. For readers who run optimizer ablations on vision models this could be worth checking once the numbers and ablations are in hand. It is an incremental but cleanly defined variant rather than a broad advance. I would send it for peer review so the experimental details and the limit description can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Ada2MS, a hybrid optimization algorithm that uses continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates to achieve a smooth transition between AdamW-like and momentum-SGD-like behaviors. It claims competitive performance on visual tasks under a unified optimizer-comparison protocol, with code to be released.

Significance. If the claimed transition can be shown to reliably combine stability and generalization benefits without introducing new instabilities, and if the experimental results are substantiated with quantitative metrics, the method could offer a practical addition to the optimizer toolkit for vision models. The planned code release supports reproducibility.

major comments (2)

[§3] §3 (update rule derivation): the global second-moment limit produces an update of the form m_t / sqrt(v_global) (scaled by momentum), which remains a globally normalized adaptive step. This does not recover the un-normalized momentum accumulation used by momentum SGD, contradicting the central claim of a smooth transition to momentum-SGD-like behavior.
[Experiments] Experimental section: the abstract states competitive results on visual tasks under a unified protocol, yet no quantitative metrics, baseline tables, ablation studies on the mixing parameter, or error bars are supplied. This prevents verification of the performance claims that support the method's practical value.

minor comments (2)

[Abstract] Abstract: the unified optimizer-comparison protocol is referenced but not outlined (e.g., hyperparameter search ranges or exact tasks); a short description would aid readers.
[§3] Notation: the exponential mixing parameter is introduced but its symbol is not consistently defined across the method equations and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve clarity and substantiation of the claims.

read point-by-point responses

Referee: [§3] §3 (update rule derivation): the global second-moment limit produces an update of the form m_t / sqrt(v_global) (scaled by momentum), which remains a globally normalized adaptive step. This does not recover the un-normalized momentum accumulation used by momentum SGD, contradicting the central claim of a smooth transition to momentum-SGD-like behavior.

Authors: We appreciate this careful examination of the limiting case. The referee is correct that setting the mixing to produce a purely global second-moment estimate yields an update of the form m_t / sqrt(v_global), which applies a global normalization rather than the un-normalized accumulation of standard momentum SGD. This means the transition is to a globally scaled momentum update rather than exactly recovering the classical momentum-SGD rule. We will revise Section 3, the abstract, and the introduction to describe the limiting behavior more precisely as interpolation between AdamW and a globally normalized momentum update. We will also add a short discussion of how this global normalization can still confer generalization benefits similar to momentum SGD in vision tasks when the global scale is appropriately tuned. revision: yes
Referee: [Experiments] Experimental section: the abstract states competitive results on visual tasks under a unified protocol, yet no quantitative metrics, baseline tables, ablation studies on the mixing parameter, or error bars are supplied. This prevents verification of the performance claims that support the method's practical value.

Authors: We agree that the experimental presentation is currently insufficient for independent verification. In the revised version we will expand the experimental section to include: (i) full baseline comparison tables reporting quantitative metrics (top-1 accuracy, training loss) on the visual tasks, (ii) ablation studies that vary the exponential mixing parameter across a range of values and show the resulting performance curve, and (iii) error bars computed from at least three independent runs with different random seeds. All experiments will continue to follow the unified optimizer-comparison protocol described in the manuscript. These additions will directly support the competitiveness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: Ada2MS is defined directly as an interpolation rule

full rationale

The paper defines Ada2MS explicitly as an exponential mixing rule between elementwise and global second-moment estimates to interpolate between AdamW-like and momentum-SGD-like updates. This is a constructive algorithmic proposal rather than a derivation chain that reduces claimed behavior or performance to fitted inputs, self-citations, or prior results by construction. Empirical results on visual tasks are presented as outcomes of this definition under a unified protocol, with no equations shown that force the transition or results tautologically. The derivation is self-contained as a new optimizer design.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate hyperparameters or background assumptions, but the method necessarily depends on at least one mixing control parameter whose value is not derived from first principles.

free parameters (1)

exponential mixing parameter
Controls the continuous interpolation weight between elementwise and global second-moment estimates; its selection is required for the claimed smooth transition but is not derived in the abstract.

pith-pipeline@v0.9.0 · 5696 in / 1247 out tokens · 33651 ms · 2026-05-21T06:53:06.882416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ada2MS performs smooth interpolation between AdamW-like behavior and momentum-SGD-like behavior through exponential interpolation between elementwise second-moment estimates and global second-moment estimates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Nature598(7879), 137–143 (2021)

J. Jumper, R. Evans, A. Pritzel, et al., Highly accurate protein structure prediction with alphafold, Nature 596 (7873) (2021) 583–589. doi:10.1038/s41586-021- 03819-2

work page doi:10.1038/s41586-021- 2021
[2]

Merchant, S

A. Merchant, S. Batzner, S. S. Schoenholz, et al., Scaling deep learning for ma- terials discovery, Nature 624 (7990) (2023) 80–85. doi:10.1038/s41586-023- 06735-9

work page doi:10.1038/s41586-023- 2023
[3]

S. K. Kim, R. Shousha, S. M. Yang, et al., Highest fusion performance without harmful edge energy bursts in tokamak, Nature Communications 15 (1) (2024) 3990–4001. doi:10.1038/s41467-024-48415-w. 21

work page doi:10.1038/s41467-024-48415-w 2024
[4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al., Llama 2: Open foundation and fine-tuned chat models (2023). URLhttps://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam (2017). URLhttp://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

N. S. Keskar, R. R. Socher, Improving generalization performance by switching from adam to sgd (2017). URLhttp://arxiv.org/abs/1712.07628

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Z. Xie, X. Wang, H. Zhang, et al., Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum, in: International Conference on Machine Learning, V ol. 162, 2022, pp. 24430–24459

work page 2022
[8]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initializa- tion and momentum in deep learning, in: International Conference on Machine Learning, 2013, p. 1139–1147

work page 2013
[9]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021, pp. 1–21

work page 2021
[10]

Z. Liu, H. Mao, C. Y . Wu, et al., A convnet for the 2020s, in: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2022, pp. 11966–11976. doi:10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022
[11]

D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back- propagating errors, Nature 323 (6088) (1986) 533–536. doi:10.1038/323533a0

work page doi:10.1038/323533a0 1986
[12]

Gitman, H

I. Gitman, H. Lang, P. Zhang, L. Xiao, Understanding the role of momentum in stochastic gradient methods, in: International Conference on Neural Information Processing Systems, V ol. 32, 2019, pp. 1–11

work page 2019
[13]

Ramezani-Kebrya, K

A. Ramezani-Kebrya, K. Antonakopoulos, V . Cevher, et al., On the generaliza- tion of stochastic gradient descent with momentum, Journal of Machine Learning Research 25 (22) (2024) 1–56. 22

work page 2024
[14]

S. Zhao, C. Shi, Y . Xie, W. Li, Stochastic normalized gradient descent with mo- mentum for large-batch training, SCIENCE CHINA Information Sciences 67 (11) (2024) 1–15. doi:10.1007/s11432-022-3892-8

work page doi:10.1007/s11432-022-3892-8 2024
[15]

Duchi, E

J. Duchi, E. Hazan, Y . Singer, Adaptive subgradient methods for online learn- ing and stochastic optimization, The Journal of Machine Learning Research 12 (2011) 2121–2159

work page 2011
[16]

Tieleman, G

T. Tieleman, G. Hinton, Rmsprop: Divide the gradient by a running average of its recent magnitude (2012). URLhttps://cir.nii.ac.jp/crid/1370017282431050757

work page arXiv 2012
[17]

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Interna- tional Conference on Learning Representations, 2015, pp. 1–15

work page 2015
[18]

S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, in: International Conference on Learning Representations, 2018, pp. 1–23

work page 2018
[19]

Shazeer, M

N. Shazeer, M. Stern, Adafactor: Adaptive learning rates with sublinear memory cost, in: International Conference on Machine Learning, V ol. 80, 2018, pp. 4596– 4604

work page 2018
[20]

H. Liu, Z. Li, D. L. W. Hall, et al., Sophia: A scalable stochastic second-order op- timizer for language model pre-training, in: International Conference on Learning Representations, 2024, pp. 1–30

work page 2024
[21]

L. Liu, H. Jiang, P. He, et al., On the variance of the adaptive learning rate and beyond, in: International Conference on Learning Representations, 2020, pp. 1– 13

work page 2020
[22]

W. Li, Z. Zhang, X. Wang, P. Luo, Adax: Adaptive gradient descent with expo- nential long term memory (2020). URLhttps://openreview.net/forum?id=r1l-5pEtDr

work page 2020
[23]

M. Zhu, Q. Xiao, W. Min, Adamnx: An adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate 23 (2025). URLhttps://arxiv.org/abs/2511.13465

work page arXiv 2025
[24]

X. Chen, C. Liang, D. Huang, et al., Symbolic discovery of optimization algo- rithms, in: International Conference on Neural Information Processing Systems, 2023, pp. 1–30

work page 2023
[25]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, et al., Muon: An optimizer for hidden layers in neural networks (2024). URLhttps://kellerjordan.github.io/posts/muon

work page 2024
[26]

Krizhevsky, G

A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images (2009). URLhttps://citeseerx.ist.psu.edu

work page 2009
[27]

Everingham, J

M. Everingham, J. Winn, The pascal visual object classes challenge 2012 (voc2012) development kit, Pattern Analysis, Statistical Modelling and Compu- tational Learning, Tech. Rep 8

work page 2012
[28]

Stereo matching with transparency and matting,

B. Hariharan, P. Arbeláez, L. Bourdev, et al., Semantic contours from inverse detectors, in: IEEE/CVF International Conference on Computer Vision, 2011, pp. 991–998. doi:10.1109/ICCV .2011.6126343

work page doi:10.1109/iccv 2011
[29]

Z. Liu, H. Hu, Y . Lin, et al., Swin transformer v2: Scaling up capacity and reso- lution, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019

work page 2022
[30]

C. Wang, A. Bochkovskiy, H. Liao, Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in: IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475. doi:10.1109/CVPR52729.2023.00721

work page doi:10.1109/cvpr52729.2023.00721 2023
[31]

Ronneberger, P

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomed- ical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241. 24

work page 2015
[32]

Paszke, S

A. Paszke, S. Gross, F. Massa, et al., Pytorch: An imperative style, high- performance deep learning library, in: International Conference on Neural In- formation Processing Systems, V ol. 32, 2019, pp. 1–12

work page 2019
[33]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep con- volutional neural networks, in: International Conference on Neural Information Processing Systems, V ol. 25, 2012, pp. 1–9

work page 2012
[34]

Radosavovic, R

I. Radosavovic, R. P. Kosaraju, R. Girshick, et al., Designing network design spaces, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10428–10436

work page 2020
[35]

V-Net: Fully Convolutional Neural Networks for V olumetric Medical Image Segmentation

F. Milletari, N. Navab, S. A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: IEEE/CVF International Confer- ence on 3D Vision, 2016, pp. 565–571. doi:10.1109/3DV .2016.79. 25

work page doi:10.1109/3dv 2016

[1] [1]

Nature598(7879), 137–143 (2021)

J. Jumper, R. Evans, A. Pritzel, et al., Highly accurate protein structure prediction with alphafold, Nature 596 (7873) (2021) 583–589. doi:10.1038/s41586-021- 03819-2

work page doi:10.1038/s41586-021- 2021

[2] [2]

Merchant, S

A. Merchant, S. Batzner, S. S. Schoenholz, et al., Scaling deep learning for ma- terials discovery, Nature 624 (7990) (2023) 80–85. doi:10.1038/s41586-023- 06735-9

work page doi:10.1038/s41586-023- 2023

[3] [3]

S. K. Kim, R. Shousha, S. M. Yang, et al., Highest fusion performance without harmful edge energy bursts in tokamak, Nature Communications 15 (1) (2024) 3990–4001. doi:10.1038/s41467-024-48415-w. 21

work page doi:10.1038/s41467-024-48415-w 2024

[4] [4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al., Llama 2: Open foundation and fine-tuned chat models (2023). URLhttps://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam (2017). URLhttp://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

N. S. Keskar, R. R. Socher, Improving generalization performance by switching from adam to sgd (2017). URLhttp://arxiv.org/abs/1712.07628

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Z. Xie, X. Wang, H. Zhang, et al., Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum, in: International Conference on Machine Learning, V ol. 162, 2022, pp. 24430–24459

work page 2022

[8] [8]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initializa- tion and momentum in deep learning, in: International Conference on Machine Learning, 2013, p. 1139–1147

work page 2013

[9] [9]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021, pp. 1–21

work page 2021

[10] [10]

Z. Liu, H. Mao, C. Y . Wu, et al., A convnet for the 2020s, in: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2022, pp. 11966–11976. doi:10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022

[11] [11]

D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back- propagating errors, Nature 323 (6088) (1986) 533–536. doi:10.1038/323533a0

work page doi:10.1038/323533a0 1986

[12] [12]

Gitman, H

I. Gitman, H. Lang, P. Zhang, L. Xiao, Understanding the role of momentum in stochastic gradient methods, in: International Conference on Neural Information Processing Systems, V ol. 32, 2019, pp. 1–11

work page 2019

[13] [13]

Ramezani-Kebrya, K

A. Ramezani-Kebrya, K. Antonakopoulos, V . Cevher, et al., On the generaliza- tion of stochastic gradient descent with momentum, Journal of Machine Learning Research 25 (22) (2024) 1–56. 22

work page 2024

[14] [14]

S. Zhao, C. Shi, Y . Xie, W. Li, Stochastic normalized gradient descent with mo- mentum for large-batch training, SCIENCE CHINA Information Sciences 67 (11) (2024) 1–15. doi:10.1007/s11432-022-3892-8

work page doi:10.1007/s11432-022-3892-8 2024

[15] [15]

Duchi, E

J. Duchi, E. Hazan, Y . Singer, Adaptive subgradient methods for online learn- ing and stochastic optimization, The Journal of Machine Learning Research 12 (2011) 2121–2159

work page 2011

[16] [16]

Tieleman, G

T. Tieleman, G. Hinton, Rmsprop: Divide the gradient by a running average of its recent magnitude (2012). URLhttps://cir.nii.ac.jp/crid/1370017282431050757

work page arXiv 2012

[17] [17]

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Interna- tional Conference on Learning Representations, 2015, pp. 1–15

work page 2015

[18] [18]

S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, in: International Conference on Learning Representations, 2018, pp. 1–23

work page 2018

[19] [19]

Shazeer, M

N. Shazeer, M. Stern, Adafactor: Adaptive learning rates with sublinear memory cost, in: International Conference on Machine Learning, V ol. 80, 2018, pp. 4596– 4604

work page 2018

[20] [20]

H. Liu, Z. Li, D. L. W. Hall, et al., Sophia: A scalable stochastic second-order op- timizer for language model pre-training, in: International Conference on Learning Representations, 2024, pp. 1–30

work page 2024

[21] [21]

L. Liu, H. Jiang, P. He, et al., On the variance of the adaptive learning rate and beyond, in: International Conference on Learning Representations, 2020, pp. 1– 13

work page 2020

[22] [22]

W. Li, Z. Zhang, X. Wang, P. Luo, Adax: Adaptive gradient descent with expo- nential long term memory (2020). URLhttps://openreview.net/forum?id=r1l-5pEtDr

work page 2020

[23] [23]

M. Zhu, Q. Xiao, W. Min, Adamnx: An adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate 23 (2025). URLhttps://arxiv.org/abs/2511.13465

work page arXiv 2025

[24] [24]

X. Chen, C. Liang, D. Huang, et al., Symbolic discovery of optimization algo- rithms, in: International Conference on Neural Information Processing Systems, 2023, pp. 1–30

work page 2023

[25] [25]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, et al., Muon: An optimizer for hidden layers in neural networks (2024). URLhttps://kellerjordan.github.io/posts/muon

work page 2024

[26] [26]

Krizhevsky, G

A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images (2009). URLhttps://citeseerx.ist.psu.edu

work page 2009

[27] [27]

Everingham, J

M. Everingham, J. Winn, The pascal visual object classes challenge 2012 (voc2012) development kit, Pattern Analysis, Statistical Modelling and Compu- tational Learning, Tech. Rep 8

work page 2012

[28] [28]

Stereo matching with transparency and matting,

B. Hariharan, P. Arbeláez, L. Bourdev, et al., Semantic contours from inverse detectors, in: IEEE/CVF International Conference on Computer Vision, 2011, pp. 991–998. doi:10.1109/ICCV .2011.6126343

work page doi:10.1109/iccv 2011

[29] [29]

Z. Liu, H. Hu, Y . Lin, et al., Swin transformer v2: Scaling up capacity and reso- lution, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019

work page 2022

[30] [30]

C. Wang, A. Bochkovskiy, H. Liao, Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in: IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475. doi:10.1109/CVPR52729.2023.00721

work page doi:10.1109/cvpr52729.2023.00721 2023

[31] [31]

Ronneberger, P

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomed- ical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241. 24

work page 2015

[32] [32]

Paszke, S

A. Paszke, S. Gross, F. Massa, et al., Pytorch: An imperative style, high- performance deep learning library, in: International Conference on Neural In- formation Processing Systems, V ol. 32, 2019, pp. 1–12

work page 2019

[33] [33]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep con- volutional neural networks, in: International Conference on Neural Information Processing Systems, V ol. 25, 2012, pp. 1–9

work page 2012

[34] [34]

Radosavovic, R

I. Radosavovic, R. P. Kosaraju, R. Girshick, et al., Designing network design spaces, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10428–10436

work page 2020

[35] [35]

V-Net: Fully Convolutional Neural Networks for V olumetric Medical Image Segmentation

F. Milletari, N. Navab, S. A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: IEEE/CVF International Confer- ence on 3D Vision, 2016, pp. 565–571. doi:10.1109/3DV .2016.79. 25

work page doi:10.1109/3dv 2016