Recurrent Neural Network Regularization

Ilya Sutskever; Oriol Vinyals; Wojciech Zaremba

arxiv: 1409.2329 · v5 · pith:4QNKC4O7new · submitted 2014-09-08 · 💻 cs.NE

Recurrent Neural Network Regularization

Wojciech Zaremba , Ilya Sutskever , Oriol Vinyals This is my paper

classification 💻 cs.NE

keywords neuraldropoutlstmsnetworksrecurrentregularizationrnnstasks

0 comments

read the original abstract

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, image caption generation, and machine translation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
cs.HC 2026-05 unverdicted novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Augmenting Self-attention with Persistent Memory
cs.LG 2019-07 unverdicted novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
Pointer Sentinel Mixture Models
cs.CL 2016-09 conditional novelty 7.0

Pointer sentinel-LSTM mixes context copying with softmax prediction to reach 70.9 perplexity on Penn Treebank using fewer parameters than standard LSTMs.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.
Adversarial Learning for Improved Onsets and Frames Music Transcription
cs.SD 2019-06 unverdicted novelty 6.0

Adversarial training on time-frequency representations yields consistent gains in frame-level and note-level accuracy over the Onsets and Frames baseline for automatic music transcription.
Online Supervised Learning for Traffic Load Prediction in Framed-ALOHA Networks
cs.NI 2019-07 unverdicted novelty 5.0

LSTM online predictor with MOM-based labeling estimates backlog in framed-ALOHA networks and adapts to changing statistics without prior traffic model knowledge.
Wind Estimation Using Quadcopter Motion: A Machine Learning Approach
eess.SP 2019-07 unverdicted novelty 5.0

An LSTM neural network trained on simulated quadcopter states estimates turbulent wind velocities with lower mean and variance errors than a tilt-angle wind triangle method.
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
eess.AS 2019-07 unverdicted novelty 4.0

KLD-based speaker adaptation of seq2seq ASR achieves 25% relative WER reduction, outperforming the 18.7% gain from conventional acoustic model adaptation.