signADAM: Learning Confidences for Deep Neural Networks

Dong Wang; Fanhua Shang; Hongying Liu; Licheng Jiao; Qigong Sun; Wenwo Tang; Yicheng Liu

arxiv: 1907.09008 · v1 · pith:F3GDN43Tnew · submitted 2019-07-21 · 💻 cs.CV · cs.LG· math.OC· stat.ML

signADAM: Learning Confidences for Deep Neural Networks

Dong Wang , Yicheng Liu , Wenwo Tang , Fanhua Shang , Hongying Liu , Qigong Sun , Licheng Jiao This is my paper

Pith reviewed 2026-05-24 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.LGmath.OCstat.ML

keywords signADAMconfidence functionsparse gradientsAdam optimizersign-based methodsdeep neural network trainingconvergence analysisgradient sparsity

0 comments

The pith

signADAM adds the sign of stochastic gradients to Adam and uses a confidence function to sparsify updates for faster neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes signADAM by replacing the magnitude of gradients in Adam with their signs while retaining adaptive moment estimates. It then introduces signADAM++ through a confidence function that scores gradient components to emphasize large useful signals and suppress noise. This produces sparser updates intended to focus learning on the most informative samples. Convergence guarantees are derived, and experiments across datasets and models report gains over Adam, Sign-SGD and Signum in both speed and accuracy.

Core claim

The central claim is that inserting the sign operation into Adam, then modulating updates with a learned confidence that separates useful from noisy gradient entries, yields both theoretical convergence and empirical improvements in training deep networks by generating sparser, more directed steps.

What carries the argument

The confidence function that scores individual gradient components to generate sparsity while preserving the sign and adaptive-rate structure of Adam.

If this is right

The algorithms converge at rates comparable to or better than Adam under standard smoothness assumptions.
Sparser gradients reduce the impact of noisy components and improve final test performance on vision tasks.
An adaptive extension based on loss-landscape analysis further boosts generalization.
Both variants remain first-order and require only minor code changes from existing Adam implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence scoring could be tested on recurrent or transformer architectures where gradient noise patterns differ.
Sparsity might lower peak memory during back-propagation for very large models.
The approach could be combined with quantization to reduce communication volume in distributed training.

Load-bearing premise

A confidence function can be defined that consistently separates useful gradient components from noise across different models and datasets.

What would settle it

If signADAM++ fails to produce measurably sparser gradients or fails to improve final accuracy and wall-clock time over Adam on standard image-classification benchmarks, the performance advantage collapses.

Figures

Figures reproduced from arXiv: 1907.09008 by Dong Wang, Fanhua Shang, Hongying Liu, Licheng Jiao, Qigong Sun, Wenwo Tang, Yicheng Liu.

**Figure 2.** Figure 2: Raw gradients have two parts coming from correct and incorrect samples, respectively. After being processed by the calculation unit, the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Remaining ratio changes as the confidence factor increases. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the training loss and test error of all the algo [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: shows the sparsity of the confidence gradients. These parameters are sampled from a randomly chosen layer of VGG-19 on the CIFAR-10 dataset. We find that the gradients are indeed sparse when applying our confidence function into the unprocessed gradients. It fits the biological neural process as mentioned in Section 3. By observing the training loss on CIFAR-10, our signADAM++ enjoys faster convergence tha… view at source ↗

**Figure 6.** Figure 6: Comparison of the logarithm training loss and test error of all the algorithms on MNIST. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the logarithm training loss and test error of all the algorithms on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreover, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and apply it to our algorithm. It can generate more sparse gradients than existing algorithms do. We call this new algorithm signADAM++. In particular, both our algorithms are easy to implement and can speed up training of various deep neural networks. The motivation of signADAM++ is preferably learning features from the most different samples by updating large and useful gradients regardless of useless information in stochastic gradients. We also establish theoretical convergence guarantees for our algorithms. Empirical results on various datasets and models show that our algorithms yield much better performance than many state-of-the-art algorithms including SIGN-SGD, SIGNUM and ADAM. We also analyze the performance from multiple perspectives including the loss landscape and develop an adaptive method to further improve generalization. The source code is available at https://github.com/DongWanginxdu/signADAM-Learn-by-Confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

signADAM adds signed gradients to ADAM and signADAM++ layers on a confidence function for sparsity, with experiments claiming gains over ADAM and sign methods plus convergence theory.

read the letter

signADAM is basically ADAM but with the sign of the gradient instead of the raw value, and signADAM++ adds a confidence function that tries to zero out less useful components for sparser updates. The abstract and claims point to faster training and better final performance on standard datasets and models, plus some convergence analysis. They also mention loss landscape views and an extra adaptive tweak for generalization, and the code is on GitHub. That combination of signed gradients with ADAM plus the sparsity trick is the concrete new piece, and running the same experiments across multiple models is straightforward and useful. Releasing the implementation is a clear positive for anyone who wants to test it directly. The soft spot is the confidence function itself. It is presented as the key to distinguishing signal from noise, but without seeing the exact formula, how its parameters are set, and whether they stay fixed across datasets, it is hard to judge if the sparsity is doing real work or just acting like extra regularization that could be matched by other means. The empirical claims are stated as much better performance, yet the abstract does not show ablations that isolate the confidence step from the sign change or from hyperparameter differences. The theory is listed separately, so the assumptions there will need checking to see if they match typical deep-net training. This is a paper for people who work on first-order optimizers for vision models. It is incremental rather than foundational, but the experiments, code, and stated theory give it enough substance that a serious editor should send it to referees instead of desk-rejecting it.

Referee Report

2 major / 2 minor

Summary. The paper proposes signADAM, which augments ADAM with the sign operation on stochastic gradients, and signADAM++, which further applies a learned confidence function to produce sparser gradient updates. It claims theoretical convergence guarantees for both algorithms and reports superior empirical performance over baselines including SIGN-SGD, SIGNUM, and ADAM across multiple datasets and models, along with an analysis of the loss landscape and an adaptive variant for generalization. Source code is released.

Significance. If the empirical gains and convergence results are reproducible, the work offers a practical first-order optimizer that exploits gradient sparsity to accelerate training while maintaining or improving accuracy. The public release of source code is a clear strength that enables direct verification and extension.

major comments (2)

[§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.
[§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.

minor comments (2)

[§4] The convergence theorem statement would benefit from an explicit listing of all assumptions (e.g., on bounded variance, smoothness, and properties of the confidence function) in one place.
Figure captions and axis labels in the loss-landscape visualizations are occasionally underspecified, making it hard to interpret the quantitative comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. Below we respond point-by-point to the two major comments.

read point-by-point responses

Referee: [§3.2] §3.2 (definition of the confidence function): the claim that the function reliably separates useful gradient components from noise is load-bearing for both the sparsity benefit and the performance improvement, yet the precise functional form, any learned parameters, and the conditions under which it generalizes across models are not stated with sufficient precision to allow independent verification of the motivating assumption.

Authors: We agree that the current presentation of the confidence function in §3.2 lacks the required precision. In the revised manuscript we will expand this section to state the exact functional form, specify any learned parameters, and articulate the conditions under which the function is expected to generalize across models and datasets. revision: yes
Referee: [§5] §5 (experimental protocol): the reported superiority is presented without ablation isolating the contribution of the confidence function versus the sign operation alone, and without reporting variance across random seeds or hyperparameter sensitivity; this weakens the attribution of gains specifically to signADAM++.

Authors: We accept this assessment. The existing experiments compare the full algorithms against baselines but do not isolate the contribution of the confidence function from the sign operation, nor report variance over random seeds or hyper-parameter sensitivity. In the revision we will add the requested ablations together with standard deviations across multiple seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines signADAM by adding the sign operation to ADAM and introduces a confidence function to produce sparser gradients in signADAM++. It states theoretical convergence guarantees separately from the empirical performance claims on multiple datasets and models. No equations, fitted parameters renamed as predictions, or self-citation chains are visible that would reduce any central result to its own inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the confidence function and its parameters are introduced without upstream justification, and convergence is asserted under unspecified standard assumptions.

free parameters (1)

confidence function parameters
The abstract states that a confidence function is defined to distinguish gradient components, implying tunable or learned parameters whose values are not specified.

axioms (1)

domain assumption Standard assumptions on stochastic gradients suffice for convergence of the proposed sign-based updates.
The abstract claims theoretical convergence guarantees without detailing the assumptions.

invented entities (1)

confidence function no independent evidence
purpose: To score gradient components and generate sparser updates that focus on the most different samples.
Introduced in the abstract as the key addition for signADAM++.

pith-pipeline@v0.9.0 · 5773 in / 1300 out tokens · 25307 ms · 2026-05-24T18:29:33.301989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NeurIPS. 1709–1720

work page 2017
[2]

David Attwell and Simon B Laughlin. 2001. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism 21, 10 (2001), 1133–1145

work page 2001
[3]

Lukas Balles and Philipp Hennig. 2017. Dissecting adam: The sign, magnitude and variance of stochastic gradients. arXiv preprint arXiv:1705.07774 (2017)

work page arXiv 2017
[4]

Jeremy Bernstein, Kamyar Azizzadenesheli, Yu-Xiang Wang, and Anima Anand- kumar. 2018. Convergence rate of sign stochastic gradient descent for non-convex functions. (2018)

work page 2018
[5]

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anand- kumar. 2018. signSGD: Compressed optimisation for non-convex problems.arXiv preprint arXiv:1802.04434 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

work page 2004
[7]

Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In NeurIPS. 2674–2682

work page 2015
[8]

David L Donoho and Jain M Johnstone. 1994. Ideal spatial adaptation by wavelet shrinkage. biometrika 81, 3 (1994), 425–455

work page 1994
[9]

Rodney J Douglas and Kevan AC Martin. 2007. Recurrent neuronal circuits in the neocortex. Current biology 17, 13 (2007), R496–R500

work page 2007
[10]

Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In ICLR Workshop. 2013–2016

work page 2016
[11]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159

work page 2011
[12]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AIStats. 315–323

work page 2011
[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016
[14]

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554

work page 2006
[15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141. November, 2019, Beijing None

work page 2018
[16]

Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In ECML-PKDD. Springer, 795–811

work page 2016
[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. JMLR 10, Mar (2009), 777–801

work page 2009
[19]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436

work page 2015
[20]

Peter Lennie. 2003. The cost of cortical computation. Current biology 13, 6 (2003), 493–497

work page 2003
[21]

Adi Livnat, Christos Papadimitriou, Jonathan Dushoff, and Marcus W Feldman

work page
[22]

Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

A mixability theory for the role of sex in evolution. Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

work page 2008
[24]

Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR abs/1711.05101 (2017). arXiv:1711.05101 http://arxiv.org/abs/ 1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. 807–814

work page 2010
[27]

Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151

work page 1999
[28]

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. (2018)

work page 2018
[29]

Peter Richtárik and Martin Takáˇc. 2014. Iteration complexity of randomized block- coordinate descent methods for minimizing a composite function. Mathematical Programming 144, 1-2 (2014), 1–38

work page 2014
[30]

Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks, V ol. 1993. San Francisco, 586–591

work page 1993
[31]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407

work page 1951
[32]

Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science

work page 1985
[34]

David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1

work page 1988
[35]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech

work page 2014
[36]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Richard Sutton. 1986. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986. 823–832

work page 1986
[38]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on com- puter vision and pattern recognition. 1–9

work page 2015
[39]

Tieleman and G

T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

work page 2012
[40]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). IEEE, 1–5

work page 2015
[41]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS. 1509–1519

work page 2017
[42]

Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method.arXiv preprint arXiv:1212.5701 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012
[43]

∥дk ∥1 − 2 dÕ i=1 |дk,i |I(siдn(дk,i , siдn( ˜дk −t,i )) #) + δ2 k 2 ®L 1 = −δk 1 −β 1 −βk k −1Õ t =0 βt

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503. https://doi.org/10.1109/LSP.2016.2603342 APPENDIX A: PROOF OF LEMMA 1 PROOF. ∀i, m1,i = 0, m2,i ∈ {±( 1 −β), 0}. Sinceβ∈ ( 0, 1), then |m2,i | ⩽ 1. Assume ∀i, |mk,i ...

work page doi:10.1109/lsp.2016.2603342 2016

[1] [1]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NeurIPS. 1709–1720

work page 2017

[2] [2]

David Attwell and Simon B Laughlin. 2001. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism 21, 10 (2001), 1133–1145

work page 2001

[3] [3]

Lukas Balles and Philipp Hennig. 2017. Dissecting adam: The sign, magnitude and variance of stochastic gradients. arXiv preprint arXiv:1705.07774 (2017)

work page arXiv 2017

[4] [4]

Jeremy Bernstein, Kamyar Azizzadenesheli, Yu-Xiang Wang, and Anima Anand- kumar. 2018. Convergence rate of sign stochastic gradient descent for non-convex functions. (2018)

work page 2018

[5] [5]

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anand- kumar. 2018. signSGD: Compressed optimisation for non-convex problems.arXiv preprint arXiv:1802.04434 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press

work page 2004

[7] [7]

Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In NeurIPS. 2674–2682

work page 2015

[8] [8]

David L Donoho and Jain M Johnstone. 1994. Ideal spatial adaptation by wavelet shrinkage. biometrika 81, 3 (1994), 425–455

work page 1994

[9] [9]

Rodney J Douglas and Kevan AC Martin. 2007. Recurrent neuronal circuits in the neocortex. Current biology 17, 13 (2007), R496–R500

work page 2007

[10] [10]

Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In ICLR Workshop. 2013–2016

work page 2016

[11] [11]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159

work page 2011

[12] [12]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In AIStats. 315–323

work page 2011

[13] [13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016

[14] [14]

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554

work page 2006

[15] [15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141. November, 2019, Beijing None

work page 2018

[16] [16]

Hamed Karimi, Julie Nutini, and Mark Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In ECML-PKDD. Springer, 795–811

work page 2016

[17] [17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. JMLR 10, Mar (2009), 777–801

work page 2009

[19] [19]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436

work page 2015

[20] [20]

Peter Lennie. 2003. The cost of cortical computation. Current biology 13, 6 (2003), 493–497

work page 2003

[21] [21]

Adi Livnat, Christos Papadimitriou, Jonathan Dushoff, and Marcus W Feldman

work page

[22] [22]

Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

A mixability theory for the role of sex in evolution. Proceedings of the National Academy of Sciences 105, 50 (2008), 19803–19808

work page 2008

[23] [24]

Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR abs/1711.05101 (2017). arXiv:1711.05101 http://arxiv.org/abs/ 1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [25]

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. 2019. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [26]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. 807–814

work page 2010

[26] [27]

Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151

work page 1999

[27] [28]

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. (2018)

work page 2018

[28] [29]

Peter Richtárik and Martin Takáˇc. 2014. Iteration complexity of randomized block- coordinate descent methods for minimizing a composite function. Mathematical Programming 144, 1-2 (2014), 1–38

work page 2014

[29] [30]

Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks, V ol. 1993. San Francisco, 586–591

work page 1993

[30] [31]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407

work page 1951

[31] [32]

Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [33]

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science

work page 1985

[33] [34]

David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1

work page 1988

[34] [35]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech

work page 2014

[35] [36]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[36] [37]

Richard Sutton. 1986. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986. 823–832

work page 1986

[37] [38]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on com- puter vision and pattern recognition. 1–9

work page 2015

[38] [39]

Tieleman and G

T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

work page 2012

[39] [40]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). IEEE, 1–5

work page 2015

[40] [41]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS. 1509–1519

work page 2017

[41] [42]

Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method.arXiv preprint arXiv:1212.5701 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012

[42] [43]

∥дk ∥1 − 2 dÕ i=1 |дk,i |I(siдn(дk,i , siдn( ˜дk −t,i )) #) + δ2 k 2 ®L 1 = −δk 1 −β 1 −βk k −1Õ t =0 βt

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499–1503. https://doi.org/10.1109/LSP.2016.2603342 APPENDIX A: PROOF OF LEMMA 1 PROOF. ∀i, m1,i = 0, m2,i ∈ {±( 1 −β), 0}. Sinceβ∈ ( 0, 1), then |m2,i | ⩽ 1. Assume ∀i, |mk,i ...

work page doi:10.1109/lsp.2016.2603342 2016