Recognition: unknown
ADADELTA: An Adaptive Learning Rate Method
read the original abstract
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Neural Machine Translation of Rare Words with Subword Units
Subword segmentation via byte pair encoding enables open-vocabulary neural machine translation and improves BLEU scores by 1.1 on English-German and 1.3 on English-Russian WMT 2015 tasks over dictionary back-off baselines.
-
Neural Machine Translation by Jointly Learning to Align and Translate
An attention-based encoder-decoder model achieves English-to-French translation performance comparable to phrase-based systems by automatically learning soft alignments.
-
Adam: A Method for Stochastic Optimization
A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize
SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance ov...
-
Universal Adaptive Proximal Gradient Methods via Gradient Mapping Accumulation
A universal adaptive proximal gradient method converges at rates matching standard proximal gradient methods up to logarithmic factors for three problem classes without requiring knowledge of problem parameters.
-
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
RNN Encoder-Decoder learns semantically meaningful phrase representations whose conditional probabilities improve statistical machine translation when added to log-linear models.
-
\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments
VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics
NeuroPlastic is a gradient-based optimizer augmented with a multi-signal plasticity modulation mechanism that improves performance over standard updates on image classification tasks, especially in low-data regimes.
-
Harmonizing MR Images Across 100+ Scanners: Multi-site Validation with Traveling Subjects and Real-world Protocols
HACA3^+ improves upon HACA3 with better artifact encoding, attention mechanisms, and training on 100+ scanners, validated via traveling subjects for better downstream performance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.