signADAM: Learning Confidences for Deep Neural Networks
Pith reviewed 2026-05-24 18:29 UTC · model grok-4.3
The pith
signADAM adds the sign of stochastic gradients to Adam and uses a confidence function to sparsify updates for faster neural network training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that inserting the sign operation into Adam, then modulating updates with a learned confidence that separates useful from noisy gradient entries, yields both theoretical convergence and empirical improvements in training deep networks by generating sparser, more directed steps.
What carries the argument
The confidence function that scores individual gradient components to generate sparsity while preserving the sign and adaptive-rate structure of Adam.
If this is right
- The algorithms converge at rates comparable to or better than Adam under standard smoothness assumptions.
- Sparser gradients reduce the impact of noisy components and improve final test performance on vision tasks.
- An adaptive extension based on loss-landscape analysis further boosts generalization.
- Both variants remain first-order and require only minor code changes from existing Adam implementations.
Where Pith is reading between the lines
- The same confidence scoring could be tested on recurrent or transformer architectures where gradient noise patterns differ.
- Sparsity might lower peak memory during back-propagation for very large models.
- The approach could be combined with quantization to reduce communication volume in distributed training.
Load-bearing premise
A confidence function can be defined that consistently separates useful gradient components from noise across different models and datasets.
What would settle it
If signADAM++ fails to produce measurably sparser gradients or fails to improve final accuracy and wall-clock time over Adam on standard image-classification benchmarks, the performance advantage collapses.
Figures
read the original abstract
In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreover, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and apply it to our algorithm. It can generate more sparse gradients than existing algorithms do. We call this new algorithm signADAM++. In particular, both our algorithms are easy to implement and can speed up training of various deep neural networks. The motivation of signADAM++ is preferably learning features from the most different samples by updating large and useful gradients regardless of useless information in stochastic gradients. We also establish theoretical convergence guarantees for our algorithms. Empirical results on various datasets and models show that our algorithms yield much better performance than many state-of-the-art algorithms including SIGN-SGD, SIGNUM and ADAM. We also analyze the performance from multiple perspectives including the loss landscape and develop an adaptive method to further improve generalization. The source code is available at https://github.com/DongWanginxdu/signADAM-Learn-by-Confidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes signADAM, which augments ADAM with the sign operation on stochastic gradients, and signADAM++, which further applies a learned confidence function to produce sparser gradient updates. It claims theoretical convergence guarantees for both algorithms and reports superior empirical performance over baselines including SIGN-SGD, SIGNUM, and ADAM across multiple datasets and models, along with an analysis of the loss landscape and an adaptive variant for generalization. Source code is released.
Significance. If the empirical gains and convergence results are reproducible, the work offers a practical first-order optimizer that exploits gradient sparsity to accelerate training while maintaining or improving accuracy. The public release of source code is a clear strength that enables direct verification and extension.
major comments (2)
- [§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.
- [§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.
minor comments (2)
- [§4] The convergence theorem statement would benefit from an explicit listing of all assumptions (e.g., on bounded variance, smoothness, and properties of the confidence function) in one place.
- Figure captions and axis labels in the loss-landscape visualizations are occasionally underspecified, making it hard to interpret the quantitative comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation of minor revision. Below we respond point-by-point to the two major comments.
read point-by-point responses
-
Referee: [§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.
Authors: We agree that the current presentation of the confidence function in §3.2 lacks the required precision. In the revised manuscript we will expand this section to state the exact functional form, specify any learned parameters, and articulate the conditions under which the function is expected to generalize across models and datasets. revision: yes
-
Referee: [§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.
Authors: We accept this assessment. The existing experiments compare the full algorithms against baselines but do not isolate the contribution of the confidence function from the sign operation, nor report variance over random seeds or hyper-parameter sensitivity. In the revision we will add the requested ablations together with standard deviations across multiple seeds. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines signADAM by adding the sign operation to ADAM and introduces a confidence function to produce sparser gradients in signADAM++. It states theoretical convergence guarantees separately from the empirical performance claims on multiple datasets and models. No equations, fitted parameters renamed as predictions, or self-citation chains are visible that would reduce any central result to its own inputs by construction. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence function parameters
axioms (1)
- domain assumption Standard assumptions on stochastic gradients suffice for convergence of the proposed sign-based updates.
invented entities (1)
-
confidence function
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NeurIPS. 1709–1720
work page 2017
-
[2]
David Attwell and Simon B Laughlin. 2001. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism 21, 10 (2001), 1133–1145
work page 2001
- [3]
-
[4]
Jeremy Bernstein, Kamyar Azizzadenesheli, Yu-Xiang Wang, and Anima Anand- kumar. 2018. Convergence rate of sign stochastic gradient descent for non-convex functions. (2018)
work page 2018
-
[5]
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anand- kumar. 2018. signSGD: Compressed optimisation for non-convex problems.arXiv preprint arXiv:1802.04434 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press
work page 2004
-
[7]
Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In NeurIPS. 2674–2682
work page 2015
-
[8]
David L Donoho and Jain M Johnstone. 1994. Ideal spatial adaptation by wavelet shrinkage. biometrika 81, 3 (1994), 425–455
work page 1994
-
[9]
Rodney J Douglas and Kevan AC Martin. 2007. Recurrent neuronal circuits in the neocortex. Current biology 17, 13 (2007), R496–R500
work page 2007
-
[10]
Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In ICLR Workshop. 2013–2016
work page 2016
-
[11]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159
work page 2011
-
[12]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AIStats. 315–323
work page 2011
-
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778
work page 2016
-
[14]
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554
work page 2006
-
[15]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141. November, 2019, Beijing None
work page 2018
-
[16]
Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In ECML-PKDD. Springer, 795–811
work page 2016
-
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. JMLR 10, Mar (2009), 777–801
work page 2009
-
[19]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436
work page 2015
-
[20]
Peter Lennie. 2003. The cost of cortical computation. Current biology 13, 6 (2003), 493–497
work page 2003
-
[21]
Adi Livnat, Christos Papadimitriou, Jonathan Dushoff, and Marcus W Feldman
-
[22]
Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808
A mixability theory for the role of sex in evolution. Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808
work page 2008
-
[24]
Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR abs/1711.05101 (2017). arXiv:1711.05101 http://arxiv.org/abs/ 1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. 807–814
work page 2010
-
[27]
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151
work page 1999
-
[28]
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. (2018)
work page 2018
-
[29]
Peter Richtárik and Martin Takáˇc. 2014. Iteration complexity of randomized block- coordinate descent methods for minimizing a composite function. Mathematical Programming 144, 1-2 (2014), 1–38
work page 2014
-
[30]
Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks, V ol. 1993. San Francisco, 586–591
work page 1993
-
[31]
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407
work page 1951
-
[32]
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science
work page 1985
-
[34]
David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1
work page 1988
-
[35]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech
work page 2014
-
[36]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Richard Sutton. 1986. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986. 823–832
work page 1986
-
[38]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on com- puter vision and pattern recognition. 1–9
work page 2015
-
[39]
T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning
work page 2012
-
[40]
Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). IEEE, 1–5
work page 2015
-
[41]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS. 1509–1519
work page 2017
-
[42]
Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method.arXiv preprint arXiv:1212.5701 (2012)
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[43]
K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503. https://doi.org/10.1109/LSP.2016.2603342 APPENDIX A: PROOF OF LEMMA 1 PROOF. ∀i, m1,i = 0, m2,i ∈ {±( 1 −β), 0}. Sinceβ∈ ( 0, 1), then |m2,i | ⩽ 1. Assume ∀i, |mk,i ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.