pith. sign in

arxiv: 1907.09008 · v1 · pith:F3GDN43Tnew · submitted 2019-07-21 · 💻 cs.CV · cs.LG· math.OC· stat.ML

signADAM: Learning Confidences for Deep Neural Networks

Pith reviewed 2026-05-24 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.LGmath.OCstat.ML
keywords signADAMconfidence functionsparse gradientsAdam optimizersign-based methodsdeep neural network trainingconvergence analysisgradient sparsity
0
0 comments X

The pith

signADAM adds the sign of stochastic gradients to Adam and uses a confidence function to sparsify updates for faster neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes signADAM by replacing the magnitude of gradients in Adam with their signs while retaining adaptive moment estimates. It then introduces signADAM++ through a confidence function that scores gradient components to emphasize large useful signals and suppress noise. This produces sparser updates intended to focus learning on the most informative samples. Convergence guarantees are derived, and experiments across datasets and models report gains over Adam, Sign-SGD and Signum in both speed and accuracy.

Core claim

The central claim is that inserting the sign operation into Adam, then modulating updates with a learned confidence that separates useful from noisy gradient entries, yields both theoretical convergence and empirical improvements in training deep networks by generating sparser, more directed steps.

What carries the argument

The confidence function that scores individual gradient components to generate sparsity while preserving the sign and adaptive-rate structure of Adam.

If this is right

  • The algorithms converge at rates comparable to or better than Adam under standard smoothness assumptions.
  • Sparser gradients reduce the impact of noisy components and improve final test performance on vision tasks.
  • An adaptive extension based on loss-landscape analysis further boosts generalization.
  • Both variants remain first-order and require only minor code changes from existing Adam implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same confidence scoring could be tested on recurrent or transformer architectures where gradient noise patterns differ.
  • Sparsity might lower peak memory during back-propagation for very large models.
  • The approach could be combined with quantization to reduce communication volume in distributed training.

Load-bearing premise

A confidence function can be defined that consistently separates useful gradient components from noise across different models and datasets.

What would settle it

If signADAM++ fails to produce measurably sparser gradients or fails to improve final accuracy and wall-clock time over Adam on standard image-classification benchmarks, the performance advantage collapses.

Figures

Figures reproduced from arXiv: 1907.09008 by Dong Wang, Fanhua Shang, Hongying Liu, Licheng Jiao, Qigong Sun, Wenwo Tang, Yicheng Liu.

Figure 1
Figure 1. Figure 1: The connection between the loss landscape and learning. Loss [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Raw gradients have two parts coming from correct and incorrect samples, respectively. After being processed by the calculation unit, the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Remaining ratio changes as the confidence factor increases. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the training loss and test error of all the algo [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows the sparsity of the confidence gradients. These parameters are sampled from a randomly chosen layer of VGG-19 on the CIFAR-10 dataset. We find that the gradients are indeed sparse when applying our confidence function into the unprocessed gradients. It fits the biological neural process as mentioned in Section 3. By observing the training loss on CIFAR-10, our signADAM++ enjoys faster convergence tha… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the logarithm training loss and test error of all the algorithms on MNIST. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the logarithm training loss and test error of all the algorithms on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreover, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and apply it to our algorithm. It can generate more sparse gradients than existing algorithms do. We call this new algorithm signADAM++. In particular, both our algorithms are easy to implement and can speed up training of various deep neural networks. The motivation of signADAM++ is preferably learning features from the most different samples by updating large and useful gradients regardless of useless information in stochastic gradients. We also establish theoretical convergence guarantees for our algorithms. Empirical results on various datasets and models show that our algorithms yield much better performance than many state-of-the-art algorithms including SIGN-SGD, SIGNUM and ADAM. We also analyze the performance from multiple perspectives including the loss landscape and develop an adaptive method to further improve generalization. The source code is available at https://github.com/DongWanginxdu/signADAM-Learn-by-Confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes signADAM, which augments ADAM with the sign operation on stochastic gradients, and signADAM++, which further applies a learned confidence function to produce sparser gradient updates. It claims theoretical convergence guarantees for both algorithms and reports superior empirical performance over baselines including SIGN-SGD, SIGNUM, and ADAM across multiple datasets and models, along with an analysis of the loss landscape and an adaptive variant for generalization. Source code is released.

Significance. If the empirical gains and convergence results are reproducible, the work offers a practical first-order optimizer that exploits gradient sparsity to accelerate training while maintaining or improving accuracy. The public release of source code is a clear strength that enables direct verification and extension.

major comments (2)
  1. [§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.
  2. [§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.
minor comments (2)
  1. [§4] The convergence theorem statement would benefit from an explicit listing of all assumptions (e.g., on bounded variance, smoothness, and properties of the confidence function) in one place.
  2. Figure captions and axis labels in the loss-landscape visualizations are occasionally underspecified, making it hard to interpret the quantitative comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. Below we respond point-by-point to the two major comments.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.

    Authors: We agree that the current presentation of the confidence function in §3.2 lacks the required precision. In the revised manuscript we will expand this section to state the exact functional form, specify any learned parameters, and articulate the conditions under which the function is expected to generalize across models and datasets. revision: yes

  2. Referee: [§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.

    Authors: We accept this assessment. The existing experiments compare the full algorithms against baselines but do not isolate the contribution of the confidence function from the sign operation, nor report variance over random seeds or hyper-parameter sensitivity. In the revision we will add the requested ablations together with standard deviations across multiple seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines signADAM by adding the sign operation to ADAM and introduces a confidence function to produce sparser gradients in signADAM++. It states theoretical convergence guarantees separately from the empirical performance claims on multiple datasets and models. No equations, fitted parameters renamed as predictions, or self-citation chains are visible that would reduce any central result to its own inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the confidence function and its parameters are introduced without upstream justification, and convergence is asserted under unspecified standard assumptions.

free parameters (1)
  • confidence function parameters
    The abstract states that a confidence function is defined to distinguish gradient components, implying tunable or learned parameters whose values are not specified.
axioms (1)
  • domain assumption Standard assumptions on stochastic gradients suffice for convergence of the proposed sign-based updates.
    The abstract claims theoretical convergence guarantees without detailing the assumptions.
invented entities (1)
  • confidence function no independent evidence
    purpose: To score gradient components and generate sparser updates that focus on the most different samples.
    Introduced in the abstract as the key addition for signADAM++.

pith-pipeline@v0.9.0 · 5773 in / 1300 out tokens · 25307 ms · 2026-05-24T18:29:33.301989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NeurIPS. 1709–1720

  2. [2]

    David Attwell and Simon B Laughlin. 2001. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism 21, 10 (2001), 1133–1145

  3. [3]

    Lukas Balles and Philipp Hennig. 2017. Dissecting adam: The sign, magnitude and variance of stochastic gradients. arXiv preprint arXiv:1705.07774 (2017)

  4. [4]

    Jeremy Bernstein, Kamyar Azizzadenesheli, Yu-Xiang Wang, and Anima Anand- kumar. 2018. Convergence rate of sign stochastic gradient descent for non-convex functions. (2018)

  5. [5]

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anand- kumar. 2018. signSGD: Compressed optimisation for non-convex problems.arXiv preprint arXiv:1802.04434 (2018)

  6. [6]

    Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

  7. [7]

    Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In NeurIPS. 2674–2682

  8. [8]

    David L Donoho and Jain M Johnstone. 1994. Ideal spatial adaptation by wavelet shrinkage. biometrika 81, 3 (1994), 425–455

  9. [9]

    Rodney J Douglas and Kevan AC Martin. 2007. Recurrent neuronal circuits in the neocortex. Current biology 17, 13 (2007), R496–R500

  10. [10]

    Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In ICLR Workshop. 2013–2016

  11. [11]

    John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159

  12. [12]

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AIStats. 315–323

  13. [13]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  14. [14]

    Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554

  15. [15]

    Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141. November, 2019, Beijing None

  16. [16]

    Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In ECML-PKDD. Springer, 795–811

  17. [17]

    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014)

  18. [18]

    John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. JMLR 10, Mar (2009), 777–801

  19. [19]

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436

  20. [20]

    Peter Lennie. 2003. The cost of cortical computation. Current biology 13, 6 (2003), 493–497

  21. [21]

    Adi Livnat, Christos Papadimitriou, Jonathan Dushoff, and Marcus W Feldman

  22. [22]

    Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

    A mixability theory for the role of sex in evolution. Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

  23. [24]

    Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR abs/1711.05101 (2017). arXiv:1711.05101 http://arxiv.org/abs/ 1711.05101

  24. [25]

    Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)

  25. [26]

    Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. 807–814

  26. [27]

    Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151

  27. [28]

    Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. (2018)

  28. [29]

    Peter Richtárik and Martin Takáˇc. 2014. Iteration complexity of randomized block- coordinate descent methods for minimizing a composite function. Mathematical Programming 144, 1-2 (2014), 1–38

  29. [30]

    Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks, V ol. 1993. San Francisco, 586–591

  30. [31]

    Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407

  31. [32]

    Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

  32. [33]

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science

  33. [34]

    David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1

  34. [35]

    Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech

  35. [36]

    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  36. [37]

    Richard Sutton. 1986. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986. 823–832

  37. [38]

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on com- puter vision and pattern recognition. 1–9

  38. [39]

    Tieleman and G

    T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

  39. [40]

    Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). IEEE, 1–5

  40. [41]

    Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS. 1509–1519

  41. [42]

    Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method.arXiv preprint arXiv:1212.5701 (2012)

  42. [43]

    ∥дk ∥1 − 2 dÕ i=1 |дk,i |I(siдn(дk,i , siдn( ˜дk −t,i )) #) + δ2 k 2 ®L 1 = −δk 1 −β 1 −βk k −1Õ t =0 βt

    K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503. https://doi.org/10.1109/LSP.2016.2603342 APPENDIX A: PROOF OF LEMMA 1 PROOF. ∀i, m1,i = 0, m2,i ∈ {±( 1 −β), 0}. Sinceβ∈ ( 0, 1), then |m2,i | ⩽ 1. Assume ∀i, |mk,i ...