Confidence Calibration for Convolutional Neural Networks Using Structured Dropout

Adrian V. Dalca; Mert R. Sabuncu; Zhilu Zhang

arxiv: 1906.09551 · v1 · pith:TLT7ZLH4new · submitted 2019-06-23 · 💻 cs.LG · cs.CV· stat.ML

Confidence Calibration for Convolutional Neural Networks Using Structured Dropout

Zhilu Zhang , Adrian V. Dalca , Mert R. Sabuncu This is my paper

Pith reviewed 2026-05-25 17:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords confidence calibrationstructured dropoutconvolutional neural networksensemble diversityuncertainty quantificationBayesian active learningexpected calibration error

0 comments

The pith

Structured dropout improves confidence calibration in CNNs by reducing correlation among sampled models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects poor calibration of dropout-based uncertainty estimates in convolutional networks to high correlation between the different models obtained by sampling dropout masks. It argues that structured dropout, which applies dropout decisions in a spatially or layer-wise correlated manner, increases diversity among these models and thereby lowers calibration error. Experiments compare standard and structured dropout variants on SVHN, CIFAR-10, and CIFAR-100, measuring both diversity metrics and expected calibration error. The same technique is shown to benefit uncertainty-driven selection in a Bayesian active learning task. A sympathetic reader cares because well-calibrated probabilities are needed for reliable risk assessment in deployed classifiers.

Core claim

Through the lens of ensemble learning, calibration error is associated with the correlation between the models sampled with dropout. Motivated by this, structured dropout promotes model diversity and improves confidence calibration.

What carries the argument

Structured dropout that correlates dropout masks across spatial locations or network layers to reduce agreement among ensemble members.

If this is right

Lower expected calibration error on standard image classification benchmarks without altering the loss or architecture.
Higher measured diversity among dropout samples, visible in disagreement or mutual-information statistics.
Improved sample efficiency in Bayesian active learning when uncertainty estimates guide data selection.
Calibration gains that hold across multiple convolutional architectures and dropout rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The correlation-calibration link may also explain why other diversity-inducing methods such as deep ensembles tend to produce better-calibrated outputs.
Structured dropout could be combined with post-hoc recalibration techniques to achieve further gains.
The same diversity mechanism might extend to other regularizers that implicitly create ensembles, such as stochastic depth.
Testing whether the benefit persists when models are trained to convergence on larger-scale datasets would test the robustness of the claimed mechanism.

Load-bearing premise

The assumption that calibration error is caused by (and can be reduced by changing) the correlation between dropout-sampled models rather than by network architecture, optimization, or dataset properties.

What would settle it

An experiment in which structured dropout measurably lowers model correlation yet expected calibration error stays the same or rises.

Figures

Figures reproduced from arXiv: 1906.09551 by Adrian V. Dalca, Mert R. Sabuncu, Zhilu Zhang.

**Figure 2.** Figure 2: Test accuracy (left) and ECE (right) against number of models for ensemble prediction [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Test accuracy against number of training samples for models with different methods of dropout and Variation Ratios as the acquisition function on CIFAR-10. Right: Relative improvements in test accuracy over that of the first iteration with different methods of dropout. MC dropout yields the least improvements of all the methods. initially. To match up model capacity, the dropout rate is set to 0.1 fo… view at source ↗

**Figure 4.** Figure 4: Plots of test time NLL (Left) and accuracy (Right) against dropout rate for models trained with different types of dropout on the SVHN, CIFAR-10 and CIFAR-100 datasets. Models trained with structured dropout can achieve better NLL performance, particularly for moderate values of the dropout rate. DropLayer is the least sensitive to the choice of dropout rate with respect to NLL. Interestingly, the NLL dras… view at source ↗

**Figure 5.** Figure 5: Left: Test accuracy against number of training samples for models with different methods of dropout and Max Entropy (Above) / BALD (Below) as the acquisition function on CIFAR-10. Right: Relative improvements in test accuracy over that of the first iteration with different methods of dropout. Similar to results obtained with Variation Ratios, MC dropout yields the least improvements of all the methods. 13 … view at source ↗

read the original abstract

In classification applications, we often want probabilistic predictions to reflect confidence or uncertainty. Dropout, a commonly used training technique, has recently been linked to Bayesian inference, yielding an efficient way to quantify uncertainty in neural network models. However, as previously demonstrated, confidence estimates computed with a naive implementation of dropout can be poorly calibrated, particularly when using convolutional networks. In this paper, through the lens of ensemble learning, we associate calibration error with the correlation between the models sampled with dropout. Motivated by this, we explore the use of structured dropout to promote model diversity and improve confidence calibration. We use the SVHN, CIFAR-10 and CIFAR-100 datasets to empirically compare model diversity and confidence errors obtained using various dropout techniques. We also show the merit of structured dropout in a Bayesian active learning application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Structured dropout cuts calibration error on CIFAR/SVHN by increasing diversity, but the experiments do not isolate correlation as the cause.

read the letter

The core claim is that calibration error in dropout-based CNNs tracks the correlation among sampled models, and structured dropout reduces that correlation enough to improve ECE on SVHN, CIFAR-10, and CIFAR-100 while also helping Bayesian active learning. The authors position this as a low-cost fix that follows from an ensemble view of dropout. That framing is the clearest new piece relative to earlier dropout calibration work. The empirical comparisons across dropout variants are direct and use standard benchmarks, which makes the practical takeaway easy to check. The active-learning result adds a concrete downstream test that most calibration papers skip. Those are the parts that hold up on the abstract and the reported setup. The main weakness is that the causal story is not isolated. The paper shows lower correlation and lower ECE together, but does not run controls that hold individual-model accuracy, sharpness, or optimization trajectory fixed while varying only pairwise correlation. Without that, the ECE gains could come from changes in regularization strength or per-model behavior rather than ensemble diversity. The abstract gives no error bars or statistical tests, so it is also unclear how reliable the differences are. This paper is aimed at people who already train CNNs with dropout and need better uncertainty estimates for vision tasks. A reader working on reliable deep learning or active learning would get a usable idea and some numbers to replicate. It is not foundational, but the method is simple enough that the empirical results are worth referee time. I would send it out for review; the experiments are on solid ground even if the interpretation needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper claims that confidence calibration error for dropout-based CNNs arises from correlation among the models sampled by dropout. Motivated by an ensemble-learning perspective, it proposes structured dropout to increase model diversity and thereby reduce calibration error. This is evaluated empirically by comparing diversity metrics and expected calibration error (ECE) across dropout variants on SVHN, CIFAR-10 and CIFAR-100, with an additional demonstration in Bayesian active learning.

Significance. If the claimed causal link between reduced model correlation and improved calibration is substantiated, the work supplies a low-cost modification to a standard regularization technique that directly improves uncertainty quantification in deep classifiers, with immediate relevance to active learning and safety-critical applications.

major comments (1)

[Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.

minor comments (2)

[Section 3] Notation for the structured dropout masks (e.g., block size, channel vs. spatial structure) is introduced only informally; an explicit definition or pseudocode would aid reproducibility.
[Figure 3] Figure captions for the diversity-vs-ECE scatter plots should state the number of Monte-Carlo samples used to estimate each point.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive criticism. We respond to the single major comment below.

read point-by-point responses

Referee: [Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.

Authors: We agree that the experiments do not isolate pairwise correlation while holding per-model accuracy, variance, or regularization strength fixed, and that this leaves open the possibility that ECE changes arise from other mechanisms. The variants compared (standard dropout, spatial dropout, channel dropout, etc.) were chosen because they alter mask structure in ways expected to affect correlation; the consistent alignment between measured diversity and ECE across SVHN, CIFAR-10, and CIFAR-100 supports the motivating hypothesis, but the design remains correlational rather than controlled. In the revised manuscript we will (i) explicitly acknowledge this limitation in Section 4, (ii) report per-model accuracy and sharpness statistics for each method so readers can assess confounding, and (iii) add a short discussion of the difficulty of constructing a perfect isolation experiment within the dropout framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical associations and comparisons

full rationale

The paper presents an empirical investigation that associates calibration error with dropout-induced model correlation and evaluates structured dropout variants on SVHN/CIFAR datasets. No equations, fitted parameters, or derivations are shown that reduce by construction to inputs, self-citations, or ansatzes. The central premise is framed as a motivation for experiments rather than a load-bearing theorem or self-referential definition. This is a standard self-contained empirical study with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that calibration error can be attributed to model correlation in dropout ensembles and that structured dropout will reduce that correlation. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Calibration error is associated with the correlation between the models sampled with dropout
Invoked when the authors motivate structured dropout from the ensemble-learning perspective.

pith-pipeline@v0.9.0 · 5669 in / 1228 out tokens · 26207 ms · 2026-05-25T17:55:29.330844+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation
cs.AR 2026-04 unverdicted novelty 7.0

Proposes dropout-based BayesCVNNs with automated configuration search and FPGA accelerators that deliver 4.5x–13x speedups over GPUs while enabling uncertainty estimation for complex-valued neural networks.
VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning
cs.LG 2026-04 unverdicted novelty 5.0

VOLTA, consisting of a deep encoder with learnable prototypes plus cross-entropy and post-hoc temperature scaling, matches or exceeds ten UQ baselines in accuracy, achieves lower expected calibration error, and perfor...

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

The description length of deep learning models

Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216–2226, 2018

work page 2018
[2]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015

work page 2015
[3]

Stochastic gradient hamiltonian monte carlo

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014

work page 2014
[4]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016

work page 2016
[5]

A theoretically grounded application of dropout in recurrent neural networks

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems , pages 1019–1027, 2016

work page 2016
[6]

Concrete dropout

Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017

work page 2017
[7]

Deep bayesian active learning with image data

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017

work page 2017
[8]

Shake-Shake regularization

Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Bias-reduced uncertainty estimation for deep neural classiﬁers

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classiﬁers. International Conference on Learning Representations, 2019

work page 2019
[10]

Dropblock: A regularization method for convolutional networks

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727– 10737, 2018

work page 2018
[11]

Meta-learning for stochastic gradient mcmc

Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient mcmc. International Conference on Learning Representations, 2019

work page 2019
[12]

Maxout networks

Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. International Conference on Machine Learning, 2013

work page 2013
[13]

Practical variational inference for neural networks

Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011

work page 2011
[14]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017

work page 2017
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[16]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016

work page 2016
[17]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016

work page 2016
[18]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artiﬁcial Intelligence, 2018

work page 2018
[19]

What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. 9

work page 2017
[20]

Variational dropout and the local reparam- eterization trick

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam- eterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015

work page 2015
[21]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009
[22]

Neural network ensembles, cross validation, and active learning

Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems, pages 231–238, 1995

work page 1995
[23]

Calibrated structured prediction

V olodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482, 2015

work page 2015
[24]

Measures of diversity in classiﬁer ensembles and their relationship with the ensemble accuracy

Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classiﬁer ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003

work page 2003
[25]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017

work page 2017
[26]

Fractalnet: Ultra-deep neural networks without residuals

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. International Conference on Learning Representations, 2017

work page 2017
[27]

Convergent learning: Do different neural networks learn the same representations?

Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do different neural networks learn the same representations?

work page
[28]

Multiplicative normalizing ﬂows for variational bayesian neural networks

Christos Louizos and Max Welling. Multiplicative normalizing ﬂows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2218–2227. JMLR. org, 2017

work page 2017
[29]

A complete recipe for stochastic gradient mcmc

Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015

work page 2015
[30]

A practical bayesian framework for backpropagation networks

David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992

work page 1992
[31]

A simple baseline for bayesian uncertainty in deep learning

Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019

work page arXiv 1902
[32]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InTwenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015

work page 2015
[33]

Bayesian learning for neural networks , volume 118

Radford M Neal. Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012

work page 2012
[34]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011

work page 2011
[35]

A scalable laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. International Conference on Learning Representations, 2018

work page 2018
[36]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929
[37]

Efﬁcient object localization using convolutional networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efﬁcient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015

work page 2015
[38]

Robustness may be at odds with accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. International Conference on Learning Representations, 2019. 10

work page 2019
[39]

Residual networks behave like ensembles of relatively shallow networks

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016

work page 2016
[40]

Regularization of neural networks using dropconnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013

work page 2013
[41]

Towards understanding learning representations: To what extent do different neural networks learn the same representation

Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, pages 9584–9593, 2018

work page 2018
[42]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011

work page 2011
[43]

Deterministic variational inference for robust bayesian neural networks

Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández- Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. International Conference on Learning Representations, 2018

work page 2018
[44]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

work page 2017
[45]

Ensemble methods: foundations and algorithms

Zhi-Hua Zhou. Ensemble methods: foundations and algorithms . Chapman and Hall/CRC, 2012. 11 Appendix A: Performance of Uncertainty Estimates Against Dropout Rate 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 0.14 0.16 0.18 0.20Test NLL SVHN dropout dropBlock dropChannel dropLayer 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 95.8 96.0 96.2 9...

work page 2012

[1] [1]

The description length of deep learning models

Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216–2226, 2018

work page 2018

[2] [2]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015

work page 2015

[3] [3]

Stochastic gradient hamiltonian monte carlo

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014

work page 2014

[4] [4]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016

work page 2016

[5] [5]

A theoretically grounded application of dropout in recurrent neural networks

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems , pages 1019–1027, 2016

work page 2016

[6] [6]

Concrete dropout

Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017

work page 2017

[7] [7]

Deep bayesian active learning with image data

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017

work page 2017

[8] [8]

Shake-Shake regularization

Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Bias-reduced uncertainty estimation for deep neural classiﬁers

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classiﬁers. International Conference on Learning Representations, 2019

work page 2019

[10] [10]

Dropblock: A regularization method for convolutional networks

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727– 10737, 2018

work page 2018

[11] [11]

Meta-learning for stochastic gradient mcmc

Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient mcmc. International Conference on Learning Representations, 2019

work page 2019

[12] [12]

Maxout networks

Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. International Conference on Machine Learning, 2013

work page 2013

[13] [13]

Practical variational inference for neural networks

Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011

work page 2011

[14] [14]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017

work page 2017

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[16] [16]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016

work page 2016

[17] [17]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016

work page 2016

[18] [18]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artiﬁcial Intelligence, 2018

work page 2018

[19] [19]

What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. 9

work page 2017

[20] [20]

Variational dropout and the local reparam- eterization trick

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam- eterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015

work page 2015

[21] [21]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009

[22] [22]

Neural network ensembles, cross validation, and active learning

Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems, pages 231–238, 1995

work page 1995

[23] [23]

Calibrated structured prediction

V olodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482, 2015

work page 2015

[24] [24]

Measures of diversity in classiﬁer ensembles and their relationship with the ensemble accuracy

Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classiﬁer ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003

work page 2003

[25] [25]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017

work page 2017

[26] [26]

Fractalnet: Ultra-deep neural networks without residuals

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. International Conference on Learning Representations, 2017

work page 2017

[27] [27]

Convergent learning: Do different neural networks learn the same representations?

Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do different neural networks learn the same representations?

work page

[28] [28]

Multiplicative normalizing ﬂows for variational bayesian neural networks

Christos Louizos and Max Welling. Multiplicative normalizing ﬂows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2218–2227. JMLR. org, 2017

work page 2017

[29] [29]

A complete recipe for stochastic gradient mcmc

Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015

work page 2015

[30] [30]

A practical bayesian framework for backpropagation networks

David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992

work page 1992

[31] [31]

A simple baseline for bayesian uncertainty in deep learning

Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019

work page arXiv 1902

[32] [32]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InTwenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015

work page 2015

[33] [33]

Bayesian learning for neural networks , volume 118

Radford M Neal. Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012

work page 2012

[34] [34]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011

work page 2011

[35] [35]

A scalable laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. International Conference on Learning Representations, 2018

work page 2018

[36] [36]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929

[37] [37]

Efﬁcient object localization using convolutional networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efﬁcient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015

work page 2015

[38] [38]

Robustness may be at odds with accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. International Conference on Learning Representations, 2019. 10

work page 2019

[39] [39]

Residual networks behave like ensembles of relatively shallow networks

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016

work page 2016

[40] [40]

Regularization of neural networks using dropconnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013

work page 2013

[41] [41]

Towards understanding learning representations: To what extent do different neural networks learn the same representation

Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, pages 9584–9593, 2018

work page 2018

[42] [42]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011

work page 2011

[43] [43]

Deterministic variational inference for robust bayesian neural networks

Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández- Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. International Conference on Learning Representations, 2018

work page 2018

[44] [44]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

work page 2017

[45] [45]

Ensemble methods: foundations and algorithms

Zhi-Hua Zhou. Ensemble methods: foundations and algorithms . Chapman and Hall/CRC, 2012. 11 Appendix A: Performance of Uncertainty Estimates Against Dropout Rate 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 0.14 0.16 0.18 0.20Test NLL SVHN dropout dropBlock dropChannel dropLayer 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 95.8 96.0 96.2 9...

work page 2012