On the difficulty of training Recurrent Neural Networks

Razvan Pascanu; Tomas Mikolov; Yoshua Bengio

arxiv: 1211.5063 · v2 · pith:JDUSXM5Gnew · submitted 2012-11-21 · 💻 cs.LG

On the difficulty of training Recurrent Neural Networks

Razvan Pascanu , Tomas Mikolov , Yoshua Bengio This is my paper

classification 💻 cs.LG

keywords explodinggradientgradientsissuesnetworksneuralproblemsrecurrent

0 comments

read the original abstract

There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coherent-State Propagation: A Computational Framework for Simulating Bosonic Quantum Systems
quant-ph 2026-04 unverdicted novelty 8.0

Coherent-state propagation enables quasi-polynomial classical simulation of bosonic circuits with logarithmically many Kerr gates at exponentially small trace-distance error, with polynomial runtime in the weak-nonlin...
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
quant-ph 2026-04 conditional novelty 7.0

Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks
cs.LG 2024-04 unverdicted novelty 6.0

NEON provides uncertainty-aware operator learning for composite Bayesian optimization in function spaces using a single network, achieving claimed SOTA with orders of magnitude fewer parameters than ensembles.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Adaptive Federated Optimization
cs.LG 2020-02 unverdicted novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
cs.CL 2016-09 accept novelty 6.0

GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human e...
Physics-informed convolutional neural networks for fluid flow through porous media
cs.LG 2026-05 unverdicted novelty 5.0

A physics-informed CNN predicts pore-scale velocity fields from geometry and serves as a warm-start to accelerate Lattice-Boltzmann solvers in over 90% of tested cases.
Inferring identified hadron production in $pp$ collisions with physics-informed machine learning at the LHC
hep-ph 2026-05 unverdicted novelty 5.0

A physics-informed neural network infers pT spectra of pi, K, p, Lambda, and Ks in unmeasured rapidity regions from PYTHIA8 pp collisions at 13.6 TeV, achieving 1.5-5.83% yield uncertainties while reproducing yield ra...
Multimodal and Multi-view Models for Emotion Recognition
cs.CL 2019-06 unverdicted novelty 5.0

Multimodal training with attention and contrastive multi-view learning improves both combined and acoustic-only emotion recognition on IEMOCAP over prior acoustic baselines.
A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence
cs.LG 2026-04 unverdicted novelty 4.0

A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.
Preventing overfitting in deep learning using differential privacy
cs.LG 2026-03 unverdicted novelty 4.0

Differential privacy techniques can help prevent overfitting and improve generalization in deep neural networks.
Autoencoding sensory substitution
q-bio.NC 2019-07 unverdicted novelty 4.0

Deep recurrent autoencoders convert images to shortened audio signals that incorporate hearing models, enabling above-chance hand posture discrimination and object reaching after a few hours of training instead of months.
On Inductive Biases in Deep Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 4.0

Adaptive replacements for domain-specific components in deep RL agents can yield better learning on new tasks without additional tuning.