pith. sign in

arxiv: 1906.09551 · v1 · pith:TLT7ZLH4new · submitted 2019-06-23 · 💻 cs.LG · cs.CV· stat.ML

Confidence Calibration for Convolutional Neural Networks Using Structured Dropout

Pith reviewed 2026-05-25 17:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords confidence calibrationstructured dropoutconvolutional neural networksensemble diversityuncertainty quantificationBayesian active learningexpected calibration error
0
0 comments X

The pith

Structured dropout improves confidence calibration in CNNs by reducing correlation among sampled models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects poor calibration of dropout-based uncertainty estimates in convolutional networks to high correlation between the different models obtained by sampling dropout masks. It argues that structured dropout, which applies dropout decisions in a spatially or layer-wise correlated manner, increases diversity among these models and thereby lowers calibration error. Experiments compare standard and structured dropout variants on SVHN, CIFAR-10, and CIFAR-100, measuring both diversity metrics and expected calibration error. The same technique is shown to benefit uncertainty-driven selection in a Bayesian active learning task. A sympathetic reader cares because well-calibrated probabilities are needed for reliable risk assessment in deployed classifiers.

Core claim

Through the lens of ensemble learning, calibration error is associated with the correlation between the models sampled with dropout. Motivated by this, structured dropout promotes model diversity and improves confidence calibration.

What carries the argument

Structured dropout that correlates dropout masks across spatial locations or network layers to reduce agreement among ensemble members.

If this is right

  • Lower expected calibration error on standard image classification benchmarks without altering the loss or architecture.
  • Higher measured diversity among dropout samples, visible in disagreement or mutual-information statistics.
  • Improved sample efficiency in Bayesian active learning when uncertainty estimates guide data selection.
  • Calibration gains that hold across multiple convolutional architectures and dropout rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The correlation-calibration link may also explain why other diversity-inducing methods such as deep ensembles tend to produce better-calibrated outputs.
  • Structured dropout could be combined with post-hoc recalibration techniques to achieve further gains.
  • The same diversity mechanism might extend to other regularizers that implicitly create ensembles, such as stochastic depth.
  • Testing whether the benefit persists when models are trained to convergence on larger-scale datasets would test the robustness of the claimed mechanism.

Load-bearing premise

The assumption that calibration error is caused by (and can be reduced by changing) the correlation between dropout-sampled models rather than by network architecture, optimization, or dataset properties.

What would settle it

An experiment in which structured dropout measurably lowers model correlation yet expected calibration error stays the same or rises.

Figures

Figures reproduced from arXiv: 1906.09551 by Adrian V. Dalca, Mert R. Sabuncu, Zhilu Zhang.

Figure 1
Figure 1. Figure 1: Reliability diagrams of predictions produced by difference models. Models with structured [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Test accuracy (left) and ECE (right) against number of models for ensemble prediction [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Test accuracy against number of training samples for models with different methods of dropout and Variation Ratios as the acquisition function on CIFAR-10. Right: Relative improvements in test accuracy over that of the first iteration with different methods of dropout. MC dropout yields the least improvements of all the methods. initially. To match up model capacity, the dropout rate is set to 0.1 fo… view at source ↗
Figure 4
Figure 4. Figure 4: Plots of test time NLL (Left) and accuracy (Right) against dropout rate for models trained with different types of dropout on the SVHN, CIFAR-10 and CIFAR-100 datasets. Models trained with structured dropout can achieve better NLL performance, particularly for moderate values of the dropout rate. DropLayer is the least sensitive to the choice of dropout rate with respect to NLL. Interestingly, the NLL dras… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Test accuracy against number of training samples for models with different methods of dropout and Max Entropy (Above) / BALD (Below) as the acquisition function on CIFAR-10. Right: Relative improvements in test accuracy over that of the first iteration with different methods of dropout. Similar to results obtained with Variation Ratios, MC dropout yields the least improvements of all the methods. 13 … view at source ↗
read the original abstract

In classification applications, we often want probabilistic predictions to reflect confidence or uncertainty. Dropout, a commonly used training technique, has recently been linked to Bayesian inference, yielding an efficient way to quantify uncertainty in neural network models. However, as previously demonstrated, confidence estimates computed with a naive implementation of dropout can be poorly calibrated, particularly when using convolutional networks. In this paper, through the lens of ensemble learning, we associate calibration error with the correlation between the models sampled with dropout. Motivated by this, we explore the use of structured dropout to promote model diversity and improve confidence calibration. We use the SVHN, CIFAR-10 and CIFAR-100 datasets to empirically compare model diversity and confidence errors obtained using various dropout techniques. We also show the merit of structured dropout in a Bayesian active learning application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that confidence calibration error for dropout-based CNNs arises from correlation among the models sampled by dropout. Motivated by an ensemble-learning perspective, it proposes structured dropout to increase model diversity and thereby reduce calibration error. This is evaluated empirically by comparing diversity metrics and expected calibration error (ECE) across dropout variants on SVHN, CIFAR-10 and CIFAR-100, with an additional demonstration in Bayesian active learning.

Significance. If the claimed causal link between reduced model correlation and improved calibration is substantiated, the work supplies a low-cost modification to a standard regularization technique that directly improves uncertainty quantification in deep classifiers, with immediate relevance to active learning and safety-critical applications.

major comments (1)
  1. [Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.
minor comments (2)
  1. [Section 3] Notation for the structured dropout masks (e.g., block size, channel vs. spatial structure) is introduced only informally; an explicit definition or pseudocode would aid reproducibility.
  2. [Figure 3] Figure captions for the diversity-vs-ECE scatter plots should state the number of Monte-Carlo samples used to estimate each point.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive criticism. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Section 4 (Empirical Evaluation)] The central claim requires that calibration error is driven by (and improved by reducing) correlation among dropout-sampled models. The experiments compare diversity metrics and ECE across dropout variants on SVHN/CIFAR but do not include controls that hold individual-model accuracy, variance, or effective regularization fixed while varying only pairwise correlation. Without such isolation, the observed ECE reductions could arise from changes in per-model sharpness or optimization dynamics rather than ensemble diversity.

    Authors: We agree that the experiments do not isolate pairwise correlation while holding per-model accuracy, variance, or regularization strength fixed, and that this leaves open the possibility that ECE changes arise from other mechanisms. The variants compared (standard dropout, spatial dropout, channel dropout, etc.) were chosen because they alter mask structure in ways expected to affect correlation; the consistent alignment between measured diversity and ECE across SVHN, CIFAR-10, and CIFAR-100 supports the motivating hypothesis, but the design remains correlational rather than controlled. In the revised manuscript we will (i) explicitly acknowledge this limitation in Section 4, (ii) report per-model accuracy and sharpness statistics for each method so readers can assess confounding, and (iii) add a short discussion of the difficulty of constructing a perfect isolation experiment within the dropout framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical associations and comparisons

full rationale

The paper presents an empirical investigation that associates calibration error with dropout-induced model correlation and evaluates structured dropout variants on SVHN/CIFAR datasets. No equations, fitted parameters, or derivations are shown that reduce by construction to inputs, self-citations, or ansatzes. The central premise is framed as a motivation for experiments rather than a load-bearing theorem or self-referential definition. This is a standard self-contained empirical study with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that calibration error can be attributed to model correlation in dropout ensembles and that structured dropout will reduce that correlation. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Calibration error is associated with the correlation between the models sampled with dropout
    Invoked when the authors motivate structured dropout from the ensemble-learning perspective.

pith-pipeline@v0.9.0 · 5669 in / 1228 out tokens · 26207 ms · 2026-05-25T17:55:29.330844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation

    cs.AR 2026-04 unverdicted novelty 7.0

    Proposes dropout-based BayesCVNNs with automated configuration search and FPGA accelerators that deliver 4.5x–13x speedups over GPUs while enabling uncertainty estimation for complex-valued neural networks.

  2. VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    VOLTA, consisting of a deep encoder with learnable prototypes plus cross-entropy and post-hoc temperature scaling, matches or exceeds ten UQ baselines in accuracy, achieves lower expected calibration error, and perfor...

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    The description length of deep learning models

    Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216–2226, 2018

  2. [2]

    Weight uncertainty in neural network

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015

  3. [3]

    Stochastic gradient hamiltonian monte carlo

    Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014

  4. [4]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016

  5. [5]

    A theoretically grounded application of dropout in recurrent neural networks

    Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems , pages 1019–1027, 2016

  6. [6]

    Concrete dropout

    Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017

  7. [7]

    Deep bayesian active learning with image data

    Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017

  8. [8]

    Shake-Shake regularization

    Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

  9. [9]

    Bias-reduced uncertainty estimation for deep neural classifiers

    Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. International Conference on Learning Representations, 2019

  10. [10]

    Dropblock: A regularization method for convolutional networks

    Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727– 10737, 2018

  11. [11]

    Meta-learning for stochastic gradient mcmc

    Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient mcmc. International Conference on Learning Representations, 2019

  12. [12]

    Maxout networks

    Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. International Conference on Machine Learning, 2013

  13. [13]

    Practical variational inference for neural networks

    Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011

  14. [14]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  16. [16]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016

  17. [17]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016

  18. [18]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018

  19. [19]

    What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. 9

  20. [20]

    Variational dropout and the local reparam- eterization trick

    Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam- eterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015

  21. [21]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

  22. [22]

    Neural network ensembles, cross validation, and active learning

    Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems, pages 231–238, 1995

  23. [23]

    Calibrated structured prediction

    V olodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482, 2015

  24. [24]

    Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy

    Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003

  25. [25]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017

  26. [26]

    Fractalnet: Ultra-deep neural networks without residuals

    Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. International Conference on Learning Representations, 2017

  27. [27]

    Convergent learning: Do different neural networks learn the same representations?

    Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do different neural networks learn the same representations?

  28. [28]

    Multiplicative normalizing flows for variational bayesian neural networks

    Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2218–2227. JMLR. org, 2017

  29. [29]

    A complete recipe for stochastic gradient mcmc

    Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015

  30. [30]

    A practical bayesian framework for backpropagation networks

    David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992

  31. [31]

    A simple baseline for bayesian uncertainty in deep learning

    Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019

  32. [32]

    Obtaining well calibrated probabilities using bayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InTwenty-Ninth AAAI Conference on Artificial Intelligence, 2015

  33. [33]

    Bayesian learning for neural networks , volume 118

    Radford M Neal. Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012

  34. [34]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011

  35. [35]

    A scalable laplace approximation for neural networks

    Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. International Conference on Learning Representations, 2018

  36. [36]

    Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

  37. [37]

    Efficient object localization using convolutional networks

    Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015

  38. [38]

    Robustness may be at odds with accuracy

    Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. International Conference on Learning Representations, 2019. 10

  39. [39]

    Residual networks behave like ensembles of relatively shallow networks

    Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016

  40. [40]

    Regularization of neural networks using dropconnect

    Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013

  41. [41]

    Towards understanding learning representations: To what extent do different neural networks learn the same representation

    Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, pages 9584–9593, 2018

  42. [42]

    Bayesian learning via stochastic gradient langevin dynamics

    Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011

  43. [43]

    Deterministic variational inference for robust bayesian neural networks

    Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández- Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. International Conference on Learning Representations, 2018

  44. [44]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

  45. [45]

    Ensemble methods: foundations and algorithms

    Zhi-Hua Zhou. Ensemble methods: foundations and algorithms . Chapman and Hall/CRC, 2012. 11 Appendix A: Performance of Uncertainty Estimates Against Dropout Rate 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 0.14 0.16 0.18 0.20Test NLL SVHN dropout dropBlock dropChannel dropLayer 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dropout Rate 95.8 96.0 96.2 9...