Improved Regularization of Convolutional Neural Networks with Cutout

Graham W. Taylor; Terrance DeVries

arxiv: 1708.04552 · v2 · submitted 2017-08-15 · 💻 cs.CV

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries , Graham W. Taylor This is my paper

Pith reviewed 2026-05-13 20:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords cutoutregularizationconvolutional neural networksdata augmentationCIFAR-10CIFAR-100SVHNoverfitting

0 comments

The pith

Randomly masking square regions in training images improves convolutional neural network generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces cutout, a regularization method that randomly masks out square patches from input images during training. This forces convolutional networks to learn more distributed and robust features instead of relying on specific local patterns. The approach requires no architectural changes and combines readily with standard data augmentation. Experiments on CIFAR-10, CIFAR-100, and SVHN show it produces new state-of-the-art error rates when added to existing high-performing models. Readers would care because it offers a low-effort way to reduce overfitting in image classification tasks.

Core claim

The paper claims that randomly masking fixed-size square regions of the input image during training, called cutout, acts as an effective regularizer that improves the robustness and test accuracy of convolutional neural networks, achieving 2.56% error on CIFAR-10, 15.20% on CIFAR-100, and 1.30% on SVHN when applied to current state-of-the-art architectures.

What carries the argument

Cutout: the operation of selecting a random square region and setting its pixels to zero in each training image to encourage feature robustness.

Load-bearing premise

A single fixed square size and random placement will produce consistent gains across architectures and datasets without requiring dataset-specific retuning or introducing harmful bias in the learned features.

What would settle it

Applying cutout with one fixed mask size to ImageNet or another large-scale dataset and observing no reduction in top-1 error relative to the unaugmented baseline would show the gains do not generalize.

read the original abstract

Convolutional neural networks are capable of learning powerful representational spaces, which are necessary for tackling complex learning tasks. However, due to the model capacity required to capture such representations, they are often susceptible to overfitting and therefore require proper regularization in order to generalize well. In this paper, we show that the simple regularization technique of randomly masking out square regions of input during training, which we call cutout, can be used to improve the robustness and overall performance of convolutional neural networks. Not only is this method extremely easy to implement, but we also demonstrate that it can be used in conjunction with existing forms of data augmentation and other regularizers to further improve model performance. We evaluate this method by applying it to current state-of-the-art architectures on the CIFAR-10, CIFAR-100, and SVHN datasets, yielding new state-of-the-art results of 2.56%, 15.20%, and 1.30% test error respectively. Code is available at https://github.com/uoguelph-mlrg/Cutout

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cutout is a simple input-masking regularizer that delivers clear gains on CIFAR and SVHN when the square size is chosen per dataset, with code released for easy checking.

read the letter

The main thing to know is that randomly masking out square patches from training images acts as an effective, low-cost regularizer for CNNs and pushes error rates lower on the standard small-image benchmarks when layered on top of existing augmentations. The authors show this on current architectures for CIFAR-10, CIFAR-100, and SVHN, reaching 2.56%, 15.20%, and 1.30% test error respectively, and they release the code so the numbers can be verified directly. That combination of simplicity and reproducible improvement is the real contribution here. The method is straightforward to add to a training loop and appears complementary to dropout and random crops without requiring architectural changes. What the paper does well is keep the experiments focused and report consistent lifts across the three datasets using published models. The empirical case is direct: train with and without the masking and measure the difference on held-out test sets. No circular fitting or derived claims; just training runs. The soft spot is the mask size. They settle on 16x16 for the CIFAR sets and 20x20 for SVHN after manual selection, but the write-up does not include a broad sweep or cross-architecture transfer tests. If the best size shifts with network depth or dataset statistics, part of the reported lift could trace to that tuning step rather than the core masking idea alone. Still, the gains hold in the setups they actually ran, so the central result is not undermined. This paper is for practitioners who train image models on limited data and want a quick, cheap knob to turn. Readers who value reproducible tricks over theoretical novelty will find it worthwhile. It has enough grounded evidence to deserve a serious referee, even if a reviewer might ask for more sensitivity plots on the mask size. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Cutout, a simple regularization technique for CNNs that randomly masks out square regions of the input image during training. The authors show that Cutout can be combined with existing data augmentations and regularizers, and report new state-of-the-art test errors of 2.56% on CIFAR-10, 15.20% on CIFAR-100, and 1.30% on SVHN when applied to modern architectures. Public code is released for reproducibility.

Significance. If the results hold, the work is significant because it supplies an extremely lightweight regularization method that delivers consistent gains on top of strong baselines and yields new SOTA numbers on three standard benchmarks. The code release is a clear strength, supporting verification and further use by the community.

major comments (1)

[Experiments] Experiments section: the reported SOTA results use manually selected fixed cutout sizes (16×16 on CIFAR-10/100, 20×20 on SVHN). No systematic sensitivity sweep or cross-architecture transfer experiment is presented to show that performance gains hold for a range of sizes without per-dataset retuning. This leaves open whether the improvements are attributable to the method itself or to implicit hyper-parameter selection.

minor comments (2)

[Abstract] The abstract states that Cutout is applied to 'current state-of-the-art architectures' but does not name the specific models used for each dataset; adding this detail would improve clarity.
[Method] In the method description, the precise implementation of the mask (e.g., whether it is applied identically across all channels and how boundary handling is performed) could be stated more explicitly to facilitate exact reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and the recommendation for minor revision. We appreciate the recognition of Cutout as a lightweight regularization technique that yields consistent gains and new state-of-the-art results. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported SOTA results use manually selected fixed cutout sizes (16×16 on CIFAR-10/100, 20×20 on SVHN). No systematic sensitivity sweep or cross-architecture transfer experiment is presented to show that performance gains hold for a range of sizes without per-dataset retuning. This leaves open whether the improvements are attributable to the method itself or to implicit hyper-parameter selection.

Authors: We thank the referee for highlighting this point. The cutout sizes were selected as roughly half the side length of the input images (CIFAR and SVHN images are 32×32 pixels), which provides a natural scale for occluding a meaningful portion of the image without removing all semantic content. While the original manuscript focused on demonstrating the method's effectiveness when combined with strong modern architectures and existing augmentations, we agree that an explicit sensitivity analysis would better substantiate that the gains arise from the regularization mechanism itself rather than from dataset-specific tuning. In the revised version we will add a new figure and accompanying text in the Experiments section that reports test error on CIFAR-10 (using Wide ResNet-28-10) for cutout sizes ranging from 0 to 24 pixels in steps of 4. This will show that performance improvements are obtained across a broad interval around the chosen size of 16, thereby addressing the concern about implicit hyper-parameter selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on independent training runs with no derivations or self-referential reductions.

full rationale

This is an empirical paper proposing the Cutout regularization method (random square masking of inputs) and reporting its effect when combined with existing augmentations. All performance numbers (2.56% on CIFAR-10, 15.20% on CIFAR-100, 1.30% on SVHN) are obtained from explicit model training and evaluation on held-out test sets; no equations, fitted parameters, or predictive derivations appear in the work. Consequently there are no self-definitional steps, fitted-input-called-prediction steps, or load-bearing self-citations that collapse any claim to its own inputs by construction. The method's description and experimental protocol are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical effectiveness of random square masking as a regularizer. The only notable free parameter is the cutout square size, which is chosen per dataset. No new entities are postulated and no circular derivations appear.

free parameters (1)

cutout square size
The side length of the masked square is a hyperparameter tuned separately for each dataset and architecture.

axioms (1)

domain assumption Standard CNN training assumptions (SGD, cross-entropy loss, data augmentation pipeline)
The method is applied on top of existing training procedures without altering their core assumptions.

pith-pipeline@v0.9.0 · 5477 in / 1330 out tokens · 47010 ms · 2026-05-13T20:32:05.432740+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cutout can be used in conjunction with existing forms of data augmentation and other regularizers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Navigating Potholes with Geometry-Aware Sharpness Minimization
cs.LG 2026-05 unverdicted novelty 7.0

LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes
cs.LG 2026-05 unverdicted novelty 7.0

BICL uses biased non-uniform transition matrices to generate constrained complementary labels, enabling effective learning and over sevenfold accuracy gains on many-class image datasets.
Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation
stat.ML 2026-05 conditional novelty 7.0

The test error of random-feature ridge regression with arbitrary data augmentation admits a closed-form asymptotic characterization in the proportional regime that depends only on population covariances and augmentati...
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
cs.LG 2026-05 unverdicted novelty 7.0

SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...
Layerwise LQR for Geometry-Aware Optimization of Deep Networks
cs.LG 2026-05 unverdicted novelty 7.0

Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-awar...
QB-LIF: Learnable-Scale Quantized Burst Neurons for Efficient SNNs
cs.CV 2026-04 unverdicted novelty 7.0

QB-LIF uses a trainable quantization scale for burst neurons in SNNs to raise accuracy at ultra-low latency on vision and event datasets while preserving neuromorphic hardware compatibility.
Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
cs.LG 2026-04 unverdicted novelty 7.0

Unlearnable examples fail under pretraining-finetuning due to semantic filtering by frozen layers, but Shallow Semantic Camouflage restores effectiveness by confining perturbations to semantically valid subspaces.
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models
cs.CV 2026-04 unverdicted novelty 7.0

SAM-family models split into occluder-aware types that avoid predicting into occluded regions and occluder-agnostic types that confidently segment hidden areas, shown via a new benchmark on polyp datasets.
Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP
cs.LG 2024-12 conditional novelty 7.0

PAR fine-tunes CLIP to remove backdoors from structured triggers while preserving standard performance, and works even with only synthetic image-text pairs.
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators
cs.LG 2026-05 unverdicted novelty 6.0

PEACH uses a novel spatio-temporal point cloud sequence encoder plus auxiliary supervision to enable zero-shot adaptation of graph network simulators to unseen physical properties, outperforming mesh-based baselines i...
Anatomy of a failure: When, how, and why deep vision fails in scientific domains
cs.CV 2026-05 unverdicted novelty 6.0

Deep learning on information-rich scientific images collapses to one-dimensional predictions due to a mismatch between data priors and the model's simplicity bias, even after robustification techniques.
IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging
cs.CV 2026-04 unverdicted novelty 6.0

IonMorphNet is a ConvNeXt-based classifier trained on six spatial pattern classes from 53 MSI datasets that performs generalizable peak picking and improves mSCF1 by 7% over prior methods while also aiding tumor class...
Enhancing Tabular Anomaly Detection via Pseudo-Label-Guided Generation
cs.AI 2026-04 unverdicted novelty 6.0

PLAG boosts tabular anomaly detection by using pseudo-label-guided synthetic anomaly generation with a two-stage filter, achieving SOTA results and lifting F1 scores by 0.08-0.21 when added to existing detectors.
Soft Label Pruning and Quantization for Large-Scale Dataset Distillation
cs.CV 2026-04 unverdicted novelty 6.0

LPQLD reduces soft label storage in dataset distillation by 78-500x on ImageNet datasets via pruning with dynamic reuse and quantization with student-teacher alignment, while improving accuracy.
FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction
cs.CV 2026-04 unverdicted novelty 6.0

FireSenseNet dual-branch CNN with CAFIM cross-attention outperforms larger models on next-day wildfire spread prediction, reaching F1 of 0.4176 on the Google benchmark.
OASIC: Occlusion-Agnostic and Severity-Informed Classification
cs.CV 2026-04 conditional novelty 6.0

OASIC uses anomaly-based masking and severity estimation to select occlusion-matched models, improving AUC on occluded images by up to 23.7 points.
Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation
cs.CV 2025-12 unverdicted novelty 6.0

Semantic-aware random convolution and intensity-based source matching enable effective single-source domain generalization for medical image segmentation, outperforming prior methods and sometimes matching in-domain p...
Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition
cs.CV 2025-04 unverdicted novelty 6.0

Masked Language Prompting masks selected words in reference captions and leverages LLMs to produce diverse, semantically coherent completions for style-consistent generative image augmentation without fine-tuning.
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
cs.CV 2024-11 unverdicted novelty 6.0

Orthogonal subspace decomposition via SVD on vision foundation model features preserves high-rank pre-trained knowledge by freezing principal components and adapting residuals, reducing overfitting for better generali...
Decouple then Converge: Handling Unknown Unlabeled Distributions in Long-Tailed Semi-Supervised Learning
cs.LG 2024-06 unverdicted novelty 6.0

DeCon decouples LTSSL into head-class and tail-class branches that interact and converge, delivering SOTA accuracy on mismatched-distribution benchmarks and outperforming prior methods even on matched distributions.
Sharpness-Aware Minimization for Efficiently Improving Generalization
cs.LG 2020-10 conditional novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks
cs.CL 2019-07 unverdicted novelty 6.0

DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
XferNAS: Transfer Neural Architecture Search
cs.LG 2019-07 unverdicted novelty 6.0

XferNAS transfers knowledge across neural architecture searches to reduce search time by a factor of 33 on CIFAR-10/100 while achieving new records of 1.99% and 14.06% error.
Learning Data Augmentation Strategies for Object Detection
cs.CV 2019-06 unverdicted novelty 6.0

Learned data augmentation policies optimized for object detection improve COCO mAP by more than 2.3 and transfer to other datasets and models.
Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification
cs.CV 2026-05 unverdicted novelty 5.0

DPL-ReID adds dual prompt learning, real-world occlusion augmentation, and weighted gated fusion to CLIP for state-of-the-art occluded person re-identification on benchmark datasets.
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
cs.LG 2026-05 unverdicted novelty 5.0

Introduces a margin-adaptive confidence ranking method that learns an estimator from simulated diversity and derives margin-dependent generalization bounds for use in fixed-sequence testing of LLM-human agreement.
ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision
cs.CV 2026-05 unverdicted novelty 5.0

ZScribbleSeg maximizes scribble supervision with efficient annotation forms, spatial regularization, and EM-estimated class ratios to deliver competitive performance on six medical segmentation tasks without full labels.
Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator
cs.CV 2026-04 unverdicted novelty 5.0

Supervised ClassMix and a Sup-Unsup Feature Discriminator yield an average 2.07% mIoU gain over standard semi-supervised methods on Chase and COVID-19 datasets.
Bi-Level Optimization for Single Domain Generalization
cs.LG 2026-04 unverdicted novelty 5.0

BiSDG applies bi-level optimization with surrogate domains and a domain prompt encoder to achieve state-of-the-art results in single domain generalization.
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 5.0

WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
eess.IV 2026-04 unverdicted novelty 5.0

MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
How Data Augmentation Shapes Neural Representations
cs.LG 2026-05 unverdicted novelty 4.0

Data augmentation produces well-behaved trajectories in shape-invariant representation space, with augmentation type steering distinct directions and geometry predicting ensembling gains.
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
cs.CV 2026-05 unverdicted novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
FGML-DG: Feynman-Inspired Cognitive Science Paradigm for Cross-Domain Medical Image Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

FGML-DG applies Feynman-inspired principles of concept simplification, memory recall, and error-focused retraining within a meta-learning setup to enhance domain generalization for medical image segmentation.
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems
cs.LG 2019-07 unverdicted novelty 4.0

Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
Further advantages of data augmentation on convolutional neural networks
cs.CV 2019-06 unverdicted novelty 4.0

Data augmentation enables CNNs to adapt to varying architectures and data amounts without hyperparameter fine-tuning, unlike weight decay and dropout.
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions
cs.LG 2026-05 accept novelty 3.0

NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Genetic Network Architecture Search
cs.NE 2019-07 unverdicted novelty 3.0

Genetic algorithm searches convolution cell architectures with weight sharing via SGD, reporting 96% accuracy on CIFAR10 and 80.1% on CIFAR100.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 41 Pith papers

[1]

Bengio, A

Y . Bengio, A. Bergeron, N. Boulanger-Lewandowski, T. Breuel, Y . Chherawala, et al. Deep learners beneﬁt more from out-of-distribution examples. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelli- gence and Statistics, pages 164–172, 2011

work page 2011
[2]

Canziani, A

A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. In IEEE International Symposium on Circuits & Systems, 2016

work page 2016
[3]

Coates, A

A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial In- telligence and Statistics, pages 215–223, 2011

work page 2011
[4]

Shake-Shake regularization

X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page Pith review arXiv 2017
[5]

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Com- puter Vision, pages 630–645. Springer, 2016

work page 2016
[6]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by pre- venting co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012

work page Pith review arXiv 2012
[7]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009

work page 2009
[8]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems , pages 1097–1105, 2012

work page 2012
[9]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[10]

Lemley, S

J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation-learning an optimal data augmentation strat- egy. IEEE Access, 2017

work page 2017
[11]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015

work page 2015
[12]

Netzer, T

Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in natural images with unsupervised fea- ture learning. In NIPS Workshop on Deep Learning and Un- supervised Feature Learning, volume 2011, page 5, 2011

work page 2011
[13]

Park and N

S. Park and N. Kwak. Analysis on the dropout effect in con- volutional neural networks. In Asian Conference on Com- puter Vision, pages 189–204. Springer, 2016

work page 2016
[14]

Pathak, P

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016

work page 2016
[15]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929
[16]

Tompson, R

J. Tompson, R. Goroshin, A. Jain, Y . LeCun, and C. Bregler. Efﬁcient object localization using convolutional networks. In CVPR, pages 648–656, 2015

work page 2015
[17]

Toshev and C

A. Toshev and C. Szegedy. Deeppose: Human pose estima- tion via deep neural networks. In CVPR, pages 1653–1660, 2014

work page 2014
[18]

Vincent, H

P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.- A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local de- noising criterion. Journal of Machine Learning Research , 11(Dec):3371–3408, 2010

work page 2010
[19]

Vinyals, A

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015

work page 2015
[20]

Wu and X

H. Wu and X. Gu. Towards dropout training for convolu- tional neural networks. Neural Networks, 71:1–10, 2015

work page 2015
[21]

R. Wu, S. Yan, Y . Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015

work page Pith review arXiv 2015
[22]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. British Machine Vision Conference (BMVC), 2016. A. Supplementary Materials 0 20 40 60 80 100 120 Feature/uni00A0activations/uni00A0(sorted/uni00A0by/uni00A0magnitude) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Magnitude/uni00A0of/uni00A0activation Cutout Baseline (a) 2nd Residual Block 0 50 100 150 2...

work page 2016

[1] [1]

Bengio, A

Y . Bengio, A. Bergeron, N. Boulanger-Lewandowski, T. Breuel, Y . Chherawala, et al. Deep learners beneﬁt more from out-of-distribution examples. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelli- gence and Statistics, pages 164–172, 2011

work page 2011

[2] [2]

Canziani, A

A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. In IEEE International Symposium on Circuits & Systems, 2016

work page 2016

[3] [3]

Coates, A

A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial In- telligence and Statistics, pages 215–223, 2011

work page 2011

[4] [4]

Shake-Shake regularization

X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page Pith review arXiv 2017

[5] [5]

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Com- puter Vision, pages 630–645. Springer, 2016

work page 2016

[6] [6]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by pre- venting co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012

work page Pith review arXiv 2012

[7] [7]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009

work page 2009

[8] [8]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems , pages 1097–1105, 2012

work page 2012

[9] [9]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998

work page 1998

[10] [10]

Lemley, S

J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation-learning an optimal data augmentation strat- egy. IEEE Access, 2017

work page 2017

[11] [11]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015

work page 2015

[12] [12]

Netzer, T

Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in natural images with unsupervised fea- ture learning. In NIPS Workshop on Deep Learning and Un- supervised Feature Learning, volume 2011, page 5, 2011

work page 2011

[13] [13]

Park and N

S. Park and N. Kwak. Analysis on the dropout effect in con- volutional neural networks. In Asian Conference on Com- puter Vision, pages 189–204. Springer, 2016

work page 2016

[14] [14]

Pathak, P

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016

work page 2016

[15] [15]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929

[16] [16]

Tompson, R

J. Tompson, R. Goroshin, A. Jain, Y . LeCun, and C. Bregler. Efﬁcient object localization using convolutional networks. In CVPR, pages 648–656, 2015

work page 2015

[17] [17]

Toshev and C

A. Toshev and C. Szegedy. Deeppose: Human pose estima- tion via deep neural networks. In CVPR, pages 1653–1660, 2014

work page 2014

[18] [18]

Vincent, H

P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.- A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local de- noising criterion. Journal of Machine Learning Research , 11(Dec):3371–3408, 2010

work page 2010

[19] [19]

Vinyals, A

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015

work page 2015

[20] [20]

Wu and X

H. Wu and X. Gu. Towards dropout training for convolu- tional neural networks. Neural Networks, 71:1–10, 2015

work page 2015

[21] [21]

R. Wu, S. Yan, Y . Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015

work page Pith review arXiv 2015

[22] [22]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. British Machine Vision Conference (BMVC), 2016. A. Supplementary Materials 0 20 40 60 80 100 120 Feature/uni00A0activations/uni00A0(sorted/uni00A0by/uni00A0magnitude) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Magnitude/uni00A0of/uni00A0activation Cutout Baseline (a) 2nd Residual Block 0 50 100 150 2...

work page 2016