Further advantages of data augmentation on convolutional neural networks

Alex Hern\'andez-Garc\'ia; Peter K\"onig

arxiv: 1906.11052 · v1 · pith:R6GHZT2Inew · submitted 2019-06-26 · 💻 cs.CV · cs.LG

Further advantages of data augmentation on convolutional neural networks

Alex Hern\'andez-Garc\'ia , Peter K\"onig This is my paper

Pith reviewed 2026-05-25 15:49 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords data augmentationconvolutional neural networksregularizationweight decaydropoutablation studieshyperparameter tuning

0 comments

The pith

Convolutional networks trained only with data augmentation adapt more easily to different architectures and data amounts than those using weight decay or dropout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts ablation studies on various convolutional neural network architectures trained with different amounts of data to compare data augmentation against explicit regularization methods. It establishes that data augmentation alone provides a more adaptable form of regularization. This means the same training setup works across changes in model or data size without needing to retune parameters. Readers would care because it questions the common practice of always combining data augmentation with weight decay and dropout.

Core claim

Through systematic ablations, the authors show that the regularization effect of data augmentation enables networks to maintain performance when switching architectures or varying training data volume, whereas weight decay and dropout necessitate specific hyperparameter adjustments for each new configuration.

What carries the argument

Ablation studies isolating the effects of data augmentation versus weight decay and dropout across multiple architectures and training data regimes.

If this is right

Training can proceed with less dependence on hyperparameter searches for regularization techniques.
Changing network architecture or dataset size requires minimal additional adjustments when using data augmentation.
The benefits of data augmentation generalize more readily than those of explicit regularizers.
Combining data augmentation with weight decay and dropout may not always be necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners might save time and compute by reducing reliance on tuning explicit regularizers.
This finding could prompt tests of data augmentation's adaptability in tasks beyond image classification.
Future studies might explore whether other implicit regularization methods share this flexibility.

Load-bearing premise

The ablation studies cover a sufficiently broad and representative set of network architectures and training data regimes.

What would settle it

Finding a new architecture or data amount where weight decay or dropout requires less hyperparameter adjustment than data augmentation to achieve comparable adaptability.

Figures

Figures reproduced from arXiv: 1906.11052 by Alex Hern\'andez-Garc\'ia, Peter K\"onig.

**Figure 2.** Figure 2: Test performance of the models trained with weight decay and dropout (red) and the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Data augmentation is a popular technique largely used to enhance the training of convolutional neural networks. Although many of its benefits are well known by deep learning researchers and practitioners, its implicit regularization effects, as compared to popular explicit regularization techniques, such as weight decay and dropout, remain largely unstudied. As a matter of fact, convolutional neural networks for image object classification are typically trained with both data augmentation and explicit regularization, assuming the benefits of all techniques are complementary. In this paper, we systematically analyze these techniques through ablation studies of different network architectures trained with different amounts of training data. Our results unveil a largely ignored advantage of data augmentation: networks trained with just data augmentation more easily adapt to different architectures and amount of training data, as opposed to weight decay and dropout, which require specific fine-tuning of their hyperparameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ablations indicate data augmentation adapts more readily to architecture and data-size changes than weight decay or dropout because it avoids hyperparameter retuning, though the breadth of those tests determines how far the claim travels.

read the letter

The main point is that networks trained with data augmentation alone handle switches in architecture or training-set size better than those relying on weight decay or dropout, since the latter need their coefficients retuned for each new setup. The paper reaches this through ablation studies that train multiple CNNs on varying data amounts and isolate the regularization methods. This comparison is the concrete addition: it treats the implicit regularization from augmentation as a distinct option rather than assuming all techniques are simply additive. The practical angle is clear—fewer hyperparameters to adjust when experimenting with new models or datasets. The experiments appear to cover standard image-classification setups, which fits the claim. The soft spot is the scope. The abstract supplies no tables, error bars, or exact model/data combinations, so it is impossible to judge effect sizes or consistency. If the tested architectures stay within a narrow band (a few ResNet or VGG variants on CIFAR subsets) and the data regimes are limited, the observed robustness could be tied to those choices rather than a general property of augmentation. That is the load-bearing assumption the stress-test note flags, and it needs the full results to evaluate. The work is empirical and avoids circular derivations. It engages the existing regularization literature without obvious contradictions. This is useful for practitioners who train CNNs for vision and want to reduce tuning overhead, but it is not a foundational result. A serious referee should see the full ablation details and check whether the tested range supports the generalization. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper claims, based on ablation studies of CNN architectures trained with varying amounts of data, that data augmentation offers an implicit regularization benefit allowing networks to adapt more readily to changes in architecture and training set size without hyperparameter retuning, in contrast to weight decay and dropout which require architecture- and data-specific tuning.

Significance. If substantiated, the result would identify a practical advantage of data augmentation in reducing the hyperparameter search burden during CNN training, potentially affecting standard practices that combine all three regularizers.

major comments (2)

[Experiments / ablation studies] The central generalization that data augmentation is inherently more adaptable rests on the assumption that the ablation studies span a sufficiently representative set of architectures and data regimes; without explicit enumeration of the tested networks (beyond generic references) and data sizes, or justification that they cover edge cases such as very small training sets or non-standard CNN variants, the observed robustness could be specific to the chosen experimental conditions rather than a general property.
[Abstract and results presentation] No quantitative results, tables of accuracy deltas, error bars, or statistical tests are referenced in support of the adaptability claim, making it impossible to assess effect sizes or whether the advantage over weight decay/dropout holds after accounting for variance across runs.

minor comments (1)

Notation for the regularization techniques is introduced without a dedicated table or equation defining the exact implementations (e.g., the form of weight decay or dropout probability schedule).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Experiments / ablation studies] The central generalization that data augmentation is inherently more adaptable rests on the assumption that the ablation studies span a sufficiently representative set of architectures and data regimes; without explicit enumeration of the tested networks (beyond generic references) and data sizes, or justification that they cover edge cases such as very small training sets or non-standard CNN variants, the observed robustness could be specific to the chosen experimental conditions rather than a general property.

Authors: We agree that explicit enumeration and scope justification would strengthen the claims. The original manuscript refers to 'different network architectures' and 'different amounts of training data' in generic terms in the abstract and experimental sections. In the revision we will add a dedicated subsection (or table) that explicitly lists all tested architectures (e.g., specific ResNet, VGG, and DenseNet variants) together with the exact training-set sizes and fractions used. We will also include a short discussion of the covered range, noting that the smallest sets examined were 10% of the original training data, and will clarify that the study is restricted to standard CNNs for image classification rather than claiming coverage of all possible non-standard variants. revision: yes
Referee: [Abstract and results presentation] No quantitative results, tables of accuracy deltas, error bars, or statistical tests are referenced in support of the adaptability claim, making it impossible to assess effect sizes or whether the advantage over weight decay/dropout holds after accounting for variance across runs.

Authors: The experimental results, including accuracy values for each regularization strategy across architectures and data regimes, are presented in the figures and tables of the results section. However, the manuscript does not report error bars from multiple runs or formal statistical tests, nor does it tabulate explicit accuracy deltas for the adaptability claim. We will revise the results section to include (i) a summary table of accuracy deltas between conditions, (ii) error bars computed from at least three independent runs with different random seeds, and (iii) a brief note on statistical significance where appropriate. These additions will allow readers to evaluate effect sizes and variability directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study

full rationale

The paper reports ablation experiments on CNNs trained with data augmentation versus weight decay/dropout, varying architectures and training-set sizes. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claim rests on observed experimental outcomes rather than any reduction of a result to its own inputs by construction. This is the normal case for an empirical methods paper; the breadth of tested regimes is a question of experimental design, not circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical ablation results rather than new theoretical entities or derivations; the only free parameters are the regularization hyperparameters whose tuning is the object of comparison.

free parameters (2)

weight decay coefficient
Mentioned as requiring specific fine-tuning for different architectures and data amounts.
dropout rate
Mentioned as requiring specific fine-tuning for different architectures and data amounts.

axioms (1)

domain assumption Convolutional neural networks for image classification benefit from regularization to prevent overfitting
Implicit background assumption for all compared techniques.

pith-pipeline@v0.9.0 · 5665 in / 1095 out tokens · 24636 ms · 2026-05-25T15:49:53.150494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

[1]

Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and et al

Mart\' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and et al. TensorFlow : Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org

work page 2015
[2]

Data Augmentation Generative Adversarial Networks

Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems, pp.\ 153--160, 2007

work page 2007
[4]

Deep learners benefit more from out-of-distribution examples

Yoshua Bengio, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, Thomas Breuel, Youssouf Chherawala, Moustapha Cisse, Dumitru Erhan, Jeremy Eustache, Xavier Glorot, Xavier Muller, et al. Deep learners benefit more from out-of-distribution examples. In International Conference on Artificial Intelligence and Statistics, pp.\ 164--172, 2011

work page 2011
[5]

Fran c ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015

work page 2015
[6]

Deep big simple neural nets excel on handwritten digit recognition

Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and J\"urgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. Neural Computation, 22 0 (12): 0 3207--3220, 2010

work page 2010
[7]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017 a

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Dataset augmentation in feature space

Terrance DeVries and Graham W Taylor. Dataset augmentation in feature space. In International Conference on Learning Representations, 2017 b

work page 2017
[9]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, volume 9, pp.\ 249--256, may 2010

work page 2010
[10]

Goodfellow, David Warde - Farley, Mehdi Mirza, Aaron C

Ian J. Goodfellow, David Warde - Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout networks. In International Conference on Machine Learning, pp.\ 1319--1327, 2013

work page 2013
[11]

Fractional Max-Pooling

Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

Comparing biases for minimal network construction with back-propagation

Stephen Jos \'e Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. In Advances in Neural Information Processing Systems, pp.\ 177--185, 1989

work page 1989
[13]

Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation

S ren Hauberg, Oren Freifeld, Anders Boesen Lindbo Larsen, John Fisher, and Lars Hansen. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In Artificial Intelligence and Statistics, pp.\ 342--350, 2016

work page 2016
[14]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision, pp.\ 1026--1034, 2015

work page 2015
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

work page 2016
[16]

Do deep nets really need weight decay and dropout?

Alex Hern \'a ndez-Garc \' a and Peter K \"o nig. Do deep nets really need weight decay and dropout? arXiv preprint arXiv:1802.07042, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Kietzmann

Alex Hern \'a ndez-Garc \' a, Johannes Mehrer, Nikolaus Kriegeskorte, Peter K \"o nig, and Tim C. Kietzmann. Deep neural networks trained with heavier data augmentation learn features closer to representations in hIT . In Conference on Cognitive Computational Neuroscience, 2018

work page 2018
[18]

Few-Shot Learning with Metric-Agnostic Conditional Embeddings

Nathan Hilliard, Lawrence Phillips, Scott Howland, Art \"e m Yankov, Courtney D Corley, and Nathan O Hodas. Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Vocal tract length perturbation (VTLP) improves speech recognition

Navdeep Jaitly and Geoffrey E Hinton. Vocal tract length perturbation (VTLP) improves speech recognition . In ICML Workshop on Deep Learning for Audio, Speech and Language, pp.\ 625--660, 2013

work page 2013
[20]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[21]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp.\ 1097--1105, 2012

work page 2012
[22]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521 0 (7553): 0 436--444, 2015

work page 2015
[23]

Smart augmentation-learning an optimal data augmentation strategy

Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation-learning an optimal data augmentation strategy. IEEE Access, 5: 0 5858--5869, 2017

work page 2017
[24]

Enhancing text categorization with semantic-enriched representation and training data augmentation

Xinghua Lu, Bin Zheng, Atulya Velivelli, and ChengXiang Zhai. Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13 0 (5): 0 526--535, 2006

work page 2006
[25]

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Learning to compose domain-specific transformations for data augmentation

Alexander J Ratner, Henry R Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher R \'e . Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems, pp.\ 3239--3249, 2017

work page 2017
[27]

Tangent prop-a formalism for specifying selected invariances in an adaptive network

Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, pp.\ 895--903, 1992

work page 1992
[28]

Striving for simplicity: The all convolutional net

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations, 2014

work page 2014
[29]

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 0 (1): 0 1929--1958, 2014

work page 1929
[30]

Improving music source separation based on deep neural networks through data augmentation and network blending

Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 261--265, 2017

work page 2017
[31]

V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16 0 (2): 0 264--280, 1971

work page 1971
[32]

Deep Image: Scaling up Image Recognition

Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference, BMVC, pp.\ 87.1--87.12, 2016

work page 2016
[34]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, ICLR, arXiv:1611.03530, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and et al

Mart\' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and et al. TensorFlow : Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org

work page 2015

[2] [2]

Data Augmentation Generative Adversarial Networks

Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems, pp.\ 153--160, 2007

work page 2007

[4] [4]

Deep learners benefit more from out-of-distribution examples

Yoshua Bengio, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, Thomas Breuel, Youssouf Chherawala, Moustapha Cisse, Dumitru Erhan, Jeremy Eustache, Xavier Glorot, Xavier Muller, et al. Deep learners benefit more from out-of-distribution examples. In International Conference on Artificial Intelligence and Statistics, pp.\ 164--172, 2011

work page 2011

[5] [5]

Fran c ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015

work page 2015

[6] [6]

Deep big simple neural nets excel on handwritten digit recognition

Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and J\"urgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. Neural Computation, 22 0 (12): 0 3207--3220, 2010

work page 2010

[7] [7]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017 a

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Dataset augmentation in feature space

Terrance DeVries and Graham W Taylor. Dataset augmentation in feature space. In International Conference on Learning Representations, 2017 b

work page 2017

[9] [9]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, volume 9, pp.\ 249--256, may 2010

work page 2010

[10] [10]

Goodfellow, David Warde - Farley, Mehdi Mirza, Aaron C

Ian J. Goodfellow, David Warde - Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout networks. In International Conference on Machine Learning, pp.\ 1319--1327, 2013

work page 2013

[11] [11]

Fractional Max-Pooling

Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[12] [12]

Comparing biases for minimal network construction with back-propagation

Stephen Jos \'e Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. In Advances in Neural Information Processing Systems, pp.\ 177--185, 1989

work page 1989

[13] [13]

Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation

S ren Hauberg, Oren Freifeld, Anders Boesen Lindbo Larsen, John Fisher, and Lars Hansen. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In Artificial Intelligence and Statistics, pp.\ 342--350, 2016

work page 2016

[14] [14]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision, pp.\ 1026--1034, 2015

work page 2015

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

work page 2016

[16] [16]

Do deep nets really need weight decay and dropout?

Alex Hern \'a ndez-Garc \' a and Peter K \"o nig. Do deep nets really need weight decay and dropout? arXiv preprint arXiv:1802.07042, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Kietzmann

Alex Hern \'a ndez-Garc \' a, Johannes Mehrer, Nikolaus Kriegeskorte, Peter K \"o nig, and Tim C. Kietzmann. Deep neural networks trained with heavier data augmentation learn features closer to representations in hIT . In Conference on Cognitive Computational Neuroscience, 2018

work page 2018

[18] [18]

Few-Shot Learning with Metric-Agnostic Conditional Embeddings

Nathan Hilliard, Lawrence Phillips, Scott Howland, Art \"e m Yankov, Courtney D Corley, and Nathan O Hodas. Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Vocal tract length perturbation (VTLP) improves speech recognition

Navdeep Jaitly and Geoffrey E Hinton. Vocal tract length perturbation (VTLP) improves speech recognition . In ICML Workshop on Deep Learning for Audio, Speech and Language, pp.\ 625--660, 2013

work page 2013

[20] [20]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009

[21] [21]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp.\ 1097--1105, 2012

work page 2012

[22] [22]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521 0 (7553): 0 436--444, 2015

work page 2015

[23] [23]

Smart augmentation-learning an optimal data augmentation strategy

Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation-learning an optimal data augmentation strategy. IEEE Access, 5: 0 5858--5869, 2017

work page 2017

[24] [24]

Enhancing text categorization with semantic-enriched representation and training data augmentation

Xinghua Lu, Bin Zheng, Atulya Velivelli, and ChengXiang Zhai. Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13 0 (5): 0 526--535, 2006

work page 2006

[25] [25]

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Learning to compose domain-specific transformations for data augmentation

Alexander J Ratner, Henry R Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher R \'e . Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems, pp.\ 3239--3249, 2017

work page 2017

[27] [27]

Tangent prop-a formalism for specifying selected invariances in an adaptive network

Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, pp.\ 895--903, 1992

work page 1992

[28] [28]

Striving for simplicity: The all convolutional net

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations, 2014

work page 2014

[29] [29]

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 0 (1): 0 1929--1958, 2014

work page 1929

[30] [30]

Improving music source separation based on deep neural networks through data augmentation and network blending

Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 261--265, 2017

work page 2017

[31] [31]

V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16 0 (2): 0 264--280, 1971

work page 1971

[32] [32]

Deep Image: Scaling up Image Recognition

Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference, BMVC, pp.\ 87.1--87.12, 2016

work page 2016

[34] [34]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, ICLR, arXiv:1611.03530, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017