Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton , Nitish Srivastava , Alex Krizhevsky , Ilya Sutskever , Ruslan R. Salakhutdinov

Authors on Pith no claims yet

classification 💻 cs.NE cs.CVcs.LG

keywords featuredetectorshelpfullargeneuraltraininganswerbenchmark

read the original abstract

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Adversarial Networks
stat.ML 2014-06 accept novelty 9.0

A generative model is trained to match a data distribution by competing in a minimax game against a discriminator, reaching an equilibrium where the generator recovers the true distribution and the discriminator outpu...
Deep Residual Learning for Image Recognition
cs.CV 2015-12 accept novelty 8.0

Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.
Conditional Generative Adversarial Nets
cs.LG 2014-11 accept novelty 8.0

Conditional GANs generate samples matching a given condition by supplying the condition to both generator and discriminator.
Adam: A Method for Stochastic Optimization
cs.LG 2014-12 accept novelty 7.5

A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.
Simultaneous measurements of $N$-subjettiness observables in jets from gluons and light-flavour quarks, and in decays of boosted W bosons and top quarks
hep-ex 2026-04 unverdicted novelty 7.0

CMS reports a simultaneous measurement of 25 N-subjettiness observables in 1-, 2-, and 3-prong jets, unfolded to stable particles with particle-level correlations for QCD modeling.
Improved Regularization of Convolutional Neural Networks with Cutout
cs.CV 2017-08 accept novelty 7.0

Randomly masking square regions of input images during CNN training yields new state-of-the-art test errors of 2.56% on CIFAR-10, 15.20% on CIFAR-100, and 1.30% on SVHN.
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
cs.LG 2013-08 conditional novelty 7.0

The paper introduces and compares gradient estimators for stochastic binary neurons, notably a decomposition approach and the straight-through estimator, to support sparse conditional computation in deep networks.
Explicit Dropout: Deterministic Regularization for Transformer Architectures
cs.LG 2026-04 unverdicted novelty 6.0

Explicit dropout reformulates stochastic dropout as deterministic loss penalties for Transformers, matching or exceeding standard performance with independent control per component.
Language models recognize dropout and Gaussian noise applied to their activations
cs.AI 2026-04 unverdicted novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles
cs.CV 2026-04 unverdicted novelty 4.0

A multi-stream ensemble using DINOv2 and CLIP backbones trained with extreme degradations achieves stable deepfake detection and fourth place in the NTIRE 2026 challenge.
Quantum memory and scrambling from the perspective of a classical neural network
quant-ph 2026-04 unverdicted novelty 4.0

Time-dependent quantum memory oscillates faster than OTOC, does not equilibrate, and is more sensitive to symmetry breaking, as shown by neural-network predictions on helical spin chains.