An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

Kapil Ahuja; Saurabh Saini; Saurav Kumar; Thomas Wick

arxiv: 2605.21968 · v1 · pith:JKWBK4MSnew · submitted 2026-05-21 · 💻 cs.LG

An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

Saurabh Saini , Kapil Ahuja , Thomas Wick , Saurav Kumar This is my paper

Pith reviewed 2026-05-22 07:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords adaptive optimizersdeep learningPID controllergradient descentconvergencestabilityAdam variantsimage classification

0 comments

The pith

Integrating non-increasing effective learning rates and gradient-difference modulation into AdaPID fixes convergence and stability problems inherited from Adam.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve adaptive optimizers for deep learning by tackling noisy gradients and overshoot in momentum-based methods. It builds on AdaPID, which already combines PID control with adaptivity, but notes that AdaPID still carries convergence shortfalls and stability issues from its Adam roots. To correct convergence the authors add a non-increasing effective learning rate mechanism; to correct stability they add a modulation factor driven by successive gradient differences. The resulting IAdaPID-ADG is shown to outperform standard competitors on both standard image benchmarks and real-world medical datasets, with an ablation confirming that each added piece contributes measurably.

Core claim

By grafting the non-increasing effective learning rate schedule originally proposed in AMSGrad together with the gradient-difference modulation factor originally proposed in DiffGrad onto the Adaptive PID (AdaPID) framework, the new IAdaPID-ADG optimizer simultaneously resolves the convergence and stability limitations that AdaPID inherits from Adam. On MNIST, CIFAR-10, IARC and AnnoCerv the combined optimizer produces lower final loss and higher accuracy than Adam, AMSGrad, DiffGrad, AdaPID and other baselines, while the ablation study isolates the contribution of each grafted component.

What carries the argument

The IAdaPID-ADG optimizer formed by grafting a non-increasing effective learning rate and a gradient-difference modulation factor onto the Adaptive PID controller.

If this is right

Training runs reach lower loss values because the effective learning rate never increases after a gradient step.
Gradient updates become smoother because the modulation factor damps changes when successive gradients differ sharply.
The same two grafts can be applied to other PID-based or adaptive controllers without changing their core equations.
Ablation results indicate that removing either graft measurably degrades final accuracy on the tested image datasets.
Real-world datasets such as IARC and AnnoCerv exhibit the same ranking of optimizers as the benchmark sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grafting strategy could be tested on recurrent or transformer architectures where long-term gradient stability is even more critical.
If the modulation factor can be made adaptive to batch size or network depth, the need for per-dataset hyper-parameter search might decrease further.
Theoretical convergence proofs for the combined update rule would strengthen the empirical results and guide further extensions.
The approach suggests that other control-theoretic ideas beyond PID could be combined with adaptive-rate mechanisms in a similar modular fashion.

Load-bearing premise

That grafting these two specific mechanisms onto AdaPID will correct both convergence and stability shortcomings without creating new instabilities or requiring per-dataset retuning of the added components.

What would settle it

A direct head-to-head training run on CIFAR-10 or a comparable dataset in which IAdaPID-ADG reaches higher final loss or exhibits larger oscillations than plain AdaPID or AMSGrad would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.21968 by Kapil Ahuja, Saurabh Saini, Saurav Kumar, Thomas Wick.

**Figure 2.** Figure 2: Comparison of AMSGrad, DiffGrad, AdaPID and IAdaPID-ADG optimizers on the CIFAR10 dataset; (a) training loss, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of DiffGrad and IAdaPID-ADG optimizers on the IARC dataset using (a–c) ResNet50, (d–f) ResNet101, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of DiffGrad and IAdaPID-ADG optimizers on the AnnoCerv dataset using (a–c) ResNet50, (d–f) ResNet101, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of AdaPID, AdaPIDAMS, AdaPIDDiff, and IAdaPID-ADG optimizers on the MNIST dataset; (a) training [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Optimization is essential in deep learning. The foundational method upon which most optimizers are built is momentum-based stochastic gradient descent. However, it suffers from two key drawbacks. First, it has noisy and varying gradients, and second, it has an overshoot phenomenon. To address noisy gradients, Adam was proposed, which remains the most widely used adaptive optimizer. To address the overshoot phenomenon, a control-theory-based PID optimizer was proposed. To tackle both the limitations within a single framework, several variants of Adaptive PID (AdaPID) have recently been proposed. Although AdaPID performs well, it still inherits two critical drawbacks from Adam, namely convergence and stability issues. In this work, we address both these limitations. To fix the convergence issue, we uniquely integrate the idea of using a non-increasing effective learning rate into AdaPID (originally proposed in AMSGrad, an extension of Adam). To fix the stability issue, we innovatively integrate a gradient difference based modulation factor into AdaPID (originally proposed in DiffGrad, another extension of Adam). Combining both these ideas in AdaPID, results in our novel IAdaPID-ADG optimizer. We evaluate our proposed optimizer on multiple datasets, including benchmark datasets (MNIST and CIFAR10) and real-world datasets (IARC and AnnoCerv). The IAdaPID-ADG substantially outperforms all competing optimizers. Additionally, we perform an ablation study on the MNIST dataset to demonstrate the contribution of each added component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper grafts a non-increasing learning rate and gradient-difference modulation onto AdaPID to target convergence and stability, reporting gains across MNIST, CIFAR10 and two medical sets, but the ablation stays limited to MNIST.

read the letter

The main point is that they have taken the non-increasing effective learning rate from AMSGrad and the gradient-difference modulation from DiffGrad and added both to the AdaPID framework, producing IAdaPID-ADG. The abstract says this fixes the convergence and stability problems that AdaPID still carries from Adam, and the results show it beating the baselines on the four datasets they tried.

Referee Report

2 major / 1 minor

Summary. The paper proposes IAdaPID-ADG, which augments the Adaptive PID (AdaPID) optimizer by integrating a non-increasing effective learning rate (drawn from AMSGrad) to address convergence issues and a gradient-difference-based modulation factor (drawn from DiffGrad) to address stability issues. The authors evaluate the resulting optimizer on MNIST, CIFAR10, IARC, and AnnoCerv, claiming substantial outperformance over competing methods, and include an ablation study on MNIST to show the contribution of each added component.

Significance. If the empirical gains prove robust under proper controls and statistical validation, the work offers a practical extension of control-theoretic optimization ideas by grafting two established mechanisms onto AdaPID. This could be useful for practitioners facing convergence and stability problems in Adam-style adaptive methods, though the significance hinges on whether the combination generalizes without dataset-specific retuning or new instabilities.

major comments (2)

Abstract and Experiments section: The ablation study demonstrating the individual contributions of the non-increasing effective learning rate and the gradient-difference modulation factor is reported only on MNIST. No equivalent breakdown, sensitivity analysis, or component-wise results are described for CIFAR10, IARC, or AnnoCerv. This leaves open whether the two mechanisms interact with the existing PID terms in ways that require per-dataset retuning or degrade performance on the other tasks, which directly bears on the central claim of consistent substantial outperformance across all evaluated datasets.
Experiments section: The manuscript asserts outperformance but supplies no tables with numerical results, error bars, standard deviations, or statistical significance tests for the comparisons against baselines. Without these details it is impossible to determine whether the reported gains are statistically reliable or sensitive to hyper-parameter choices and baseline implementations.

minor comments (1)

The abstract and introduction could more explicitly distinguish the proposed IAdaPID-ADG from the prior AdaPID variants mentioned, including precise citations and a clearer statement of what is novel versus what is directly inherited.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments and recommendation for major revision. Below we provide point-by-point responses to the major comments and describe the revisions we intend to make.

read point-by-point responses

Referee: Abstract and Experiments section: The ablation study demonstrating the individual contributions of the non-increasing effective learning rate and the gradient-difference modulation factor is reported only on MNIST. No equivalent breakdown, sensitivity analysis, or component-wise results are described for CIFAR10, IARC, or AnnoCerv. This leaves open whether the two mechanisms interact with the existing PID terms in ways that require per-dataset retuning or degrade performance on the other tasks, which directly bears on the central claim of consistent substantial outperformance across all evaluated datasets.

Authors: We thank the referee for highlighting this limitation. We performed the ablation study on MNIST because it is computationally lightweight and serves as a standard benchmark for isolating the effects of each component. The results on the other datasets (CIFAR10, IARC, AnnoCerv) show that IAdaPID-ADG achieves substantial outperformance, which suggests the mechanisms generalize without obvious negative interactions. Nevertheless, to provide a more complete picture, we will extend the ablation study to include CIFAR10 in the revised manuscript. revision: yes
Referee: Experiments section: The manuscript asserts outperformance but supplies no tables with numerical results, error bars, standard deviations, or statistical significance tests for the comparisons against baselines. Without these details it is impossible to determine whether the reported gains are statistically reliable or sensitive to hyper-parameter choices and baseline implementations.

Authors: We acknowledge the lack of detailed numerical tables and statistical analysis in the current version. The comparisons are shown via plots in the experiments section. To improve clarity and allow assessment of statistical reliability, we will add tables with mean performance metrics and standard deviations from repeated experiments, along with appropriate statistical tests, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical grafting of prior mechanisms with external validation

full rationale

The paper constructs IAdaPID-ADG by explicitly combining the non-increasing effective learning rate from AMSGrad and the gradient-difference modulation from DiffGrad into the existing AdaPID framework. These steps are presented as integrations of independently published ideas rather than derivations that reduce to the paper's own fitted values or self-citations. Performance claims rest on direct empirical comparisons across MNIST, CIFAR10, IARC, and AnnoCerv, plus an ablation on MNIST, with no equations shown to be equivalent to inputs by construction. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper adds no new theoretical entities or derivations; it assembles two previously published mechanisms and validates them empirically. Free parameters include the precise schedule for the non-increasing rate and the scaling of the modulation factor, both inherited from the source papers and presumably tuned on the target datasets.

free parameters (2)

non-increasing effective learning rate schedule
Inherited from AMSGrad; the exact decay form and any additional scaling constants must be chosen or fitted.
gradient-difference modulation factor
Inherited from DiffGrad; the functional form and any weighting hyper-parameter are not specified in the abstract.

axioms (2)

domain assumption AdaPID still suffers from the convergence and stability problems of Adam
Stated as motivation in the abstract; taken as given from prior literature.
ad hoc to paper Adding the two mechanisms will jointly resolve both problems
The central design choice of the paper; no proof or counter-example analysis is mentioned.

pith-pipeline@v0.9.0 · 5801 in / 1578 out tokens · 56714 ms · 2026-05-22T07:22:13.693597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

ImageNet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, pp. 1097–1105, 2012

work page 2012
[2]

Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker,

X. Lan, S. Zhang, P. C. Yuen, and R. Chellappa, “Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker,”IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 2022–2037, 2017

work page 2022
[3]

Image super-resolution using deep convolutional networks,

C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,”IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015

work page 2015
[4]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017

work page 2017
[5]

Deep learning and its applications to signal and information processing,

D. Yu and L. Deng, “Deep learning and its applications to signal and information processing,”IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 145–154, 2010

work page 2010
[6]

Deep belief networks based voice activity detection,

X.-L. Zhang and J. Wu, “Deep belief networks based voice activity detection,”IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 21, no. 4, pp. 697–710, 2012

work page 2012
[7]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[8]

LSTM: A search space odyssey,

K. Greff, R. K. Srivastava, J. Koutn ´ık, B. R. Steunebrink, and J. Schmid- huber, “LSTM: A search space odyssey,”IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, 2016

work page 2016
[9]

A unified architecture for natural language processing: Deep neural networks with multitask learning,

R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” inProceed- ings of the 25th international conference on Machine learning, 2008, pp. 160–167. 11

work page 2008
[10]

Stochastic approximation method,

H. Robbn and S. Monro, “Stochastic approximation method,”Ann. of Math. Statist, vol. 22, pp. 400–407, 1951

work page 1951
[11]

Some methods of speeding up the convergence of iteration methods,

B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,”USSR computational mathematics and mathematical physics, vol. 4, no. 5, pp. 1–17, 1964

work page 1964
[12]

A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2),

Y . Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2),” inDoklady an ussr, vol. 269, 1983, pp. 543–547

work page 1983
[13]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

PID controller- based stochastic optimization acceleration for deep neural networks,

H. Wang, Y . Luo, W. An, Q. Sun, J. Xu, and L. Zhang, “PID controller- based stochastic optimization acceleration for deep neural networks,” IEEE transactions on neural networks and learning systems, vol. 31, no. 12, pp. 5079–5091, 2020

work page 2020
[15]

AdaPID: An adaptive PID optimizer for training deep neural networks,

B. Weng, J. Sun, A. Sadeghi, and G. Wang, “AdaPID: An adaptive PID optimizer for training deep neural networks,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 3943–3947

work page 2022
[16]

PID controller- based adaptive gradient optimizer for deep neural networks,

M. Dai, Z. Zhang, X. Lai, X. Lin, and H. Wang, “PID controller- based adaptive gradient optimizer for deep neural networks,”IET Control Theory & Applications, vol. 17, no. 15, pp. 2032–2037, 2023

work page 2032
[17]

AdaPID: Adaptive momentum gradient method based on PID controller for non-convex stochastic optimization in deep learning,

A. Jian, X. Li, W. Sun, and G. Yu, “AdaPID: Adaptive momentum gradient method based on PID controller for non-convex stochastic optimization in deep learning,” 2025. [Online]. Available: https: //www.techrxiv.org/doi/abs/10.36227/techrxiv.174612953.34891083/v2

work page doi:10.36227/techrxiv.174612953.34891083/v2 2025
[18]

On the convergence of adam and beyond,

S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” inInternational Conference on Learning Representations, 2018

work page 2018
[19]

diffGrad: An optimization method for convolutional neural networks,

S. R. Dubey, S. Chakraborty, S. K. Roy, S. Mukherjee, S. K. Singh, and B. B. Chaudhuri, “diffGrad: An optimization method for convolutional neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 11, pp. 4500–4511, 2020

work page 2020
[20]

Adaptive subgradient methods for online learning and stochastic optimization

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of machine learning research, vol. 12, no. 7, 2011

work page 2011
[21]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998
[22]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009
[23]

IARC cervical cancer image bank,

International Agency for Research on Cancer, “IARC cervical cancer image bank,” 2024, accessed: 2024-05-31. [Online]. Available: https://screening.iarc.fr/cervicalimagebank.php

work page 2024
[24]

Annocerv: A new dataset for feature-driven and image-based automated colposcopy analysis,

D. A. Minciun ˘a and et al., “Annocerv: A new dataset for feature-driven and image-based automated colposcopy analysis,”Acta Universitatis Sapientiae, Informatica, vol. 15, no. 2, pp. 306–329, 2023, available from: https://github.com/iclx/AnnoCerv.git

work page 2023
[25]

Block-fused attention- driven adaptively-pooled resnet model for improved cervical cancer classification,

S. Saini, K. Ahuja, and A. S. Chauhan, “Block-fused attention- driven adaptively-pooled resnet model for improved cervical cancer classification,” 2025. [Online]. Available: https://arxiv.org/abs/2405. 01600

work page 2025
[26]

superimposed multimedia presentation editor and player,

U. Murthy, K. Ahuja, S. Murthy, and E. A. Fox, “superimposed multimedia presentation editor and player,” inProceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, ser. JCDL ’06. Association for Computing Machinery, 2006, p. 377

work page 2006
[27]

Effec- tiveness of implicit rating data on characterizing users in complex information systems,

S. Kim, U. Murthy, K. Ahuja, S. Vasile, and E. A. Fox, “Effec- tiveness of implicit rating data on characterizing users in complex information systems,” inResearch and Advanced Technology for Digital Libraries (ECDL 2005), Lecture Notes in Computer Science, A. Rauber, S. Christodoulakis, and A. M. e. Tjoa, Eds. Springer, 2005, vol. 3652, pp. 186 – 194

work page 2005
[28]

Recycling Krylov subspaces and preconditioners,

K. Ahuja, “Recycling Krylov subspaces and preconditioners,” Ph.D. dis- sertation, Virginia Polytechnic Institute and State University, Blacksburg, V A, USA, 2011

work page 2011
[29]

Stability analysis of bilinear iterative rational Krylov algorithm,

R. Choudhary and K. Ahuja, “Stability analysis of bilinear iterative rational Krylov algorithm,”Linear Algebra and its Applications, vol. 538, pp. 56–88, 2018

work page 2018

[1] [1]

ImageNet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, pp. 1097–1105, 2012

work page 2012

[2] [2]

Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker,

X. Lan, S. Zhang, P. C. Yuen, and R. Chellappa, “Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker,”IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 2022–2037, 2017

work page 2022

[3] [3]

Image super-resolution using deep convolutional networks,

C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,”IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015

work page 2015

[4] [4]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017

work page 2017

[5] [5]

Deep learning and its applications to signal and information processing,

D. Yu and L. Deng, “Deep learning and its applications to signal and information processing,”IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 145–154, 2010

work page 2010

[6] [6]

Deep belief networks based voice activity detection,

X.-L. Zhang and J. Wu, “Deep belief networks based voice activity detection,”IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 21, no. 4, pp. 697–710, 2012

work page 2012

[7] [7]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[8] [8]

LSTM: A search space odyssey,

K. Greff, R. K. Srivastava, J. Koutn ´ık, B. R. Steunebrink, and J. Schmid- huber, “LSTM: A search space odyssey,”IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, 2016

work page 2016

[9] [9]

A unified architecture for natural language processing: Deep neural networks with multitask learning,

R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” inProceed- ings of the 25th international conference on Machine learning, 2008, pp. 160–167. 11

work page 2008

[10] [10]

Stochastic approximation method,

H. Robbn and S. Monro, “Stochastic approximation method,”Ann. of Math. Statist, vol. 22, pp. 400–407, 1951

work page 1951

[11] [11]

Some methods of speeding up the convergence of iteration methods,

B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,”USSR computational mathematics and mathematical physics, vol. 4, no. 5, pp. 1–17, 1964

work page 1964

[12] [12]

A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2),

Y . Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2),” inDoklady an ussr, vol. 269, 1983, pp. 543–547

work page 1983

[13] [13]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

PID controller- based stochastic optimization acceleration for deep neural networks,

H. Wang, Y . Luo, W. An, Q. Sun, J. Xu, and L. Zhang, “PID controller- based stochastic optimization acceleration for deep neural networks,” IEEE transactions on neural networks and learning systems, vol. 31, no. 12, pp. 5079–5091, 2020

work page 2020

[15] [15]

AdaPID: An adaptive PID optimizer for training deep neural networks,

B. Weng, J. Sun, A. Sadeghi, and G. Wang, “AdaPID: An adaptive PID optimizer for training deep neural networks,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 3943–3947

work page 2022

[16] [16]

PID controller- based adaptive gradient optimizer for deep neural networks,

M. Dai, Z. Zhang, X. Lai, X. Lin, and H. Wang, “PID controller- based adaptive gradient optimizer for deep neural networks,”IET Control Theory & Applications, vol. 17, no. 15, pp. 2032–2037, 2023

work page 2032

[17] [17]

AdaPID: Adaptive momentum gradient method based on PID controller for non-convex stochastic optimization in deep learning,

A. Jian, X. Li, W. Sun, and G. Yu, “AdaPID: Adaptive momentum gradient method based on PID controller for non-convex stochastic optimization in deep learning,” 2025. [Online]. Available: https: //www.techrxiv.org/doi/abs/10.36227/techrxiv.174612953.34891083/v2

work page doi:10.36227/techrxiv.174612953.34891083/v2 2025

[18] [18]

On the convergence of adam and beyond,

S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” inInternational Conference on Learning Representations, 2018

work page 2018

[19] [19]

diffGrad: An optimization method for convolutional neural networks,

S. R. Dubey, S. Chakraborty, S. K. Roy, S. Mukherjee, S. K. Singh, and B. B. Chaudhuri, “diffGrad: An optimization method for convolutional neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 11, pp. 4500–4511, 2020

work page 2020

[20] [20]

Adaptive subgradient methods for online learning and stochastic optimization

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of machine learning research, vol. 12, no. 7, 2011

work page 2011

[21] [21]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998

[22] [22]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009

[23] [23]

IARC cervical cancer image bank,

International Agency for Research on Cancer, “IARC cervical cancer image bank,” 2024, accessed: 2024-05-31. [Online]. Available: https://screening.iarc.fr/cervicalimagebank.php

work page 2024

[24] [24]

Annocerv: A new dataset for feature-driven and image-based automated colposcopy analysis,

D. A. Minciun ˘a and et al., “Annocerv: A new dataset for feature-driven and image-based automated colposcopy analysis,”Acta Universitatis Sapientiae, Informatica, vol. 15, no. 2, pp. 306–329, 2023, available from: https://github.com/iclx/AnnoCerv.git

work page 2023

[25] [25]

Block-fused attention- driven adaptively-pooled resnet model for improved cervical cancer classification,

S. Saini, K. Ahuja, and A. S. Chauhan, “Block-fused attention- driven adaptively-pooled resnet model for improved cervical cancer classification,” 2025. [Online]. Available: https://arxiv.org/abs/2405. 01600

work page 2025

[26] [26]

superimposed multimedia presentation editor and player,

U. Murthy, K. Ahuja, S. Murthy, and E. A. Fox, “superimposed multimedia presentation editor and player,” inProceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, ser. JCDL ’06. Association for Computing Machinery, 2006, p. 377

work page 2006

[27] [27]

Effec- tiveness of implicit rating data on characterizing users in complex information systems,

S. Kim, U. Murthy, K. Ahuja, S. Vasile, and E. A. Fox, “Effec- tiveness of implicit rating data on characterizing users in complex information systems,” inResearch and Advanced Technology for Digital Libraries (ECDL 2005), Lecture Notes in Computer Science, A. Rauber, S. Christodoulakis, and A. M. e. Tjoa, Eds. Springer, 2005, vol. 3652, pp. 186 – 194

work page 2005

[28] [28]

Recycling Krylov subspaces and preconditioners,

K. Ahuja, “Recycling Krylov subspaces and preconditioners,” Ph.D. dis- sertation, Virginia Polytechnic Institute and State University, Blacksburg, V A, USA, 2011

work page 2011

[29] [29]

Stability analysis of bilinear iterative rational Krylov algorithm,

R. Choudhary and K. Ahuja, “Stability analysis of bilinear iterative rational Krylov algorithm,”Linear Algebra and its Applications, vol. 538, pp. 56–88, 2018

work page 2018