Confidence Calibration for Convolutional Neural Networks Using Structured Dropout
Pith reviewed 2026-05-25 17:55 UTC · model grok-4.3
The pith
Structured dropout improves confidence calibration in CNNs by reducing correlation among sampled models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the lens of ensemble learning, calibration error is associated with the correlation between the models sampled with dropout. Motivated by this, structured dropout promotes model diversity and improves confidence calibration.
What carries the argument
Structured dropout that correlates dropout masks across spatial locations or network layers to reduce agreement among ensemble members.
If this is right
- Lower expected calibration error on standard image classification benchmarks without altering the loss or architecture.
- Higher measured diversity among dropout samples, visible in disagreement or mutual-information statistics.
- Improved sample efficiency in Bayesian active learning when uncertainty estimates guide data selection.
- Calibration gains that hold across multiple convolutional architectures and dropout rates.
Where Pith is reading between the lines
- The correlation-calibration link may also explain why other diversity-inducing methods such as deep ensembles tend to produce better-calibrated outputs.
- Structured dropout could be combined with post-hoc recalibration techniques to achieve further gains.
- The same diversity mechanism might extend to other regularizers that implicitly create ensembles, such as stochastic depth.
- Testing whether the benefit persists when models are trained to convergence on larger-scale datasets would test the robustness of the claimed mechanism.
Load-bearing premise
The assumption that calibration error is caused by (and can be reduced by changing) the correlation between dropout-sampled models rather than by network architecture, optimization, or dataset properties.
What would settle it
An experiment in which structured dropout measurably lowers model correlation yet expected calibration error stays the same or rises.
Figures
read the original abstract
In classification applications, we often want probabilistic predictions to reflect confidence or uncertainty. Dropout, a commonly used training technique, has recently been linked to Bayesian inference, yielding an efficient way to quantify uncertainty in neural network models. However, as previously demonstrated, confidence estimates computed with a naive implementation of dropout can be poorly calibrated, particularly when using convolutional networks. In this paper, through the lens of ensemble learning, we associate calibration error with the correlation between the models sampled with dropout. Motivated by this, we explore the use of structured dropout to promote model diversity and improve confidence calibration. We use the SVHN, CIFAR-10 and CIFAR-100 datasets to empirically compare model diversity and confidence errors obtained using various dropout techniques. We also show the merit of structured dropout in a Bayesian active learning application.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that confidence calibration error for dropout-based CNNs arises from correlation among the models sampled by dropout. Motivated by an ensemble-learning perspective, it proposes structured dropout to increase model diversity and thereby reduce calibration error. This is evaluated empirically by comparing diversity metrics and expected calibration error (ECE) across dropout variants on SVHN, CIFAR-10 and CIFAR-100, with an additional demonstration in Bayesian active learning.
Significance. If the claimed causal link between reduced model correlation and improved calibration is substantiated, the work supplies a low-cost modification to a standard regularization technique that directly improves uncertainty quantification in deep classifiers, with immediate relevance to active learning and safety-critical applications.
major comments (1)
- [Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.
minor comments (2)
- [Section 3] Notation for the structured dropout masks (e.g., block size, channel vs. spatial structure) is introduced only informally; an explicit definition or pseudocode would aid reproducibility.
- [Figure 3] Figure captions for the diversity-vs-ECE scatter plots should state the number of Monte-Carlo samples used to estimate each point.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive criticism. We respond to the single major comment below.
read point-by-point responses
-
Referee: [Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.
Authors: We agree that the experiments do not isolate pairwise correlation while holding per-model accuracy, variance, or regularization strength fixed, and that this leaves open the possibility that ECE changes arise from other mechanisms. The variants compared (standard dropout, spatial dropout, channel dropout, etc.) were chosen because they alter mask structure in ways expected to affect correlation; the consistent alignment between measured diversity and ECE across SVHN, CIFAR-10, and CIFAR-100 supports the motivating hypothesis, but the design remains correlational rather than controlled. In the revised manuscript we will (i) explicitly acknowledge this limitation in Section 4, (ii) report per-model accuracy and sharpness statistics for each method so readers can assess confounding, and (iii) add a short discussion of the difficulty of constructing a perfect isolation experiment within the dropout framework. revision: yes
Circularity Check
No significant circularity; empirical associations and comparisons
full rationale
The paper presents an empirical investigation that associates calibration error with dropout-induced model correlation and evaluates structured dropout variants on SVHN/CIFAR datasets. No equations, fitted parameters, or derivations are shown that reduce by construction to inputs, self-citations, or ansatzes. The central premise is framed as a motivation for experiments rather than a load-bearing theorem or self-referential definition. This is a standard self-contained empirical study with independent experimental content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Calibration error is associated with the correlation between the models sampled with dropout
Forward citations
Cited by 2 Pith papers
-
Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation
Proposes dropout-based BayesCVNNs with automated configuration search and FPGA accelerators that deliver 4.5x–13x speedups over GPUs while enabling uncertainty estimation for complex-valued neural networks.
-
VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning
VOLTA, consisting of a deep encoder with learnable prototypes plus cross-entropy and post-hoc temperature scaling, matches or exceeds ten UQ baselines in accuracy, achieves lower expected calibration error, and perfor...
Reference graph
Works this paper leans on
-
[1]
The description length of deep learning models
Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216–2226, 2018
work page 2018
-
[2]
Weight uncertainty in neural network
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015
work page 2015
-
[3]
Stochastic gradient hamiltonian monte carlo
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014
work page 2014
-
[4]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016
work page 2016
-
[5]
A theoretically grounded application of dropout in recurrent neural networks
Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems , pages 1019–1027, 2016
work page 2016
-
[6]
Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017
work page 2017
-
[7]
Deep bayesian active learning with image data
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017
work page 2017
-
[8]
Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Bias-reduced uncertainty estimation for deep neural classifiers
Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. International Conference on Learning Representations, 2019
work page 2019
-
[10]
Dropblock: A regularization method for convolutional networks
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727– 10737, 2018
work page 2018
-
[11]
Meta-learning for stochastic gradient mcmc
Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient mcmc. International Conference on Learning Representations, 2019
work page 2019
-
[12]
Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. International Conference on Machine Learning, 2013
work page 2013
-
[13]
Practical variational inference for neural networks
Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011
work page 2011
-
[14]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017
work page 2017
-
[15]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[16]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016
work page 2016
-
[17]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016
work page 2016
-
[18]
Averaging weights leads to wider optima and better generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018
work page 2018
-
[19]
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. 9
work page 2017
-
[20]
Variational dropout and the local reparam- eterization trick
Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam- eterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015
work page 2015
-
[21]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009
work page 2009
-
[22]
Neural network ensembles, cross validation, and active learning
Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems, pages 231–238, 1995
work page 1995
-
[23]
Calibrated structured prediction
V olodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482, 2015
work page 2015
-
[24]
Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy
Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003
work page 2003
-
[25]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017
work page 2017
-
[26]
Fractalnet: Ultra-deep neural networks without residuals
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. International Conference on Learning Representations, 2017
work page 2017
-
[27]
Convergent learning: Do different neural networks learn the same representations?
Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do different neural networks learn the same representations?
-
[28]
Multiplicative normalizing flows for variational bayesian neural networks
Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2218–2227. JMLR. org, 2017
work page 2017
-
[29]
A complete recipe for stochastic gradient mcmc
Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015
work page 2015
-
[30]
A practical bayesian framework for backpropagation networks
David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992
work page 1992
-
[31]
A simple baseline for bayesian uncertainty in deep learning
Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019
-
[32]
Obtaining well calibrated probabilities using bayesian binning
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InTwenty-Ninth AAAI Conference on Artificial Intelligence, 2015
work page 2015
-
[33]
Bayesian learning for neural networks , volume 118
Radford M Neal. Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012
work page 2012
-
[34]
Reading digits in natural images with unsupervised feature learning
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011
work page 2011
-
[35]
A scalable laplace approximation for neural networks
Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. International Conference on Learning Representations, 2018
work page 2018
-
[36]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014
work page 1929
-
[37]
Efficient object localization using convolutional networks
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015
work page 2015
-
[38]
Robustness may be at odds with accuracy
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. International Conference on Learning Representations, 2019. 10
work page 2019
-
[39]
Residual networks behave like ensembles of relatively shallow networks
Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016
work page 2016
-
[40]
Regularization of neural networks using dropconnect
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013
work page 2013
-
[41]
Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, pages 9584–9593, 2018
work page 2018
-
[42]
Bayesian learning via stochastic gradient langevin dynamics
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011
work page 2011
-
[43]
Deterministic variational inference for robust bayesian neural networks
Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández- Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. International Conference on Learning Representations, 2018
work page 2018
-
[44]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017
work page 2017
-
[45]
Ensemble methods: foundations and algorithms
Zhi-Hua Zhou. Ensemble methods: foundations and algorithms . Chapman and Hall/CRC, 2012. 11 Appendix A: Performance of Uncertainty Estimates Against Dropout Rate 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 0.14 0.16 0.18 0.20Test NLL SVHN dropout dropBlock dropChannel dropLayer 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 95.8 96.0 96.2 9...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.