Equilibrated adaptive learning rates for non-convex optimization

Yann N. Dauphin , Harm de Vries , Yoshua Bengio

Authors on Pith no claims yet

classification 💻 cs.LG cs.NA

keywords adaptivelearningbetterpreconditionerrateencounteredequilibrationesgd

read the original abstract

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Session-based Recommendations with Recurrent Neural Networks
cs.LG 2015-11 conditional novelty 8.0

RNNs with ranking loss outperform item-to-item baselines for session-based recommendations on two datasets.
SGDR: Stochastic Gradient Descent with Warm Restarts
cs.LG 2016-08 accept novelty 6.0

SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.