pith. sign in

arxiv: 1907.02711 · v1 · pith:IJVUI7F2new · submitted 2019-07-05 · 💻 cs.CV · cs.LG

Prior Activation Distribution (PAD): A Versatile Representation to Utilize DNN Hidden Units

Pith reviewed 2026-05-25 02:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords deep neural networkshidden layer activationsprior activation distributionuncertainty estimationout-of-distribution detectionactivation patternsclassification tasks
0
0 comments X

The pith

Hidden layer activations in deep neural networks exhibit class-specific distributional properties usable for uncertainty estimation and out-of-distribution detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Prior Activation Distribution (PAD) to capture typical activation patterns of hidden layer units in DNNs for classification. The authors show that combined activations have class-specific properties and define statistical measures for how much a test sample deviates from these distributions. These PAD measures enable fine-grained uncertainty estimates, competitive inference accuracy without the full pipeline, and reliable isolation of out-of-distribution samples, all independent of training technique. A sympathetic reader would care because this provides a way to utilize internal representations for practical tasks like uncertainty and anomaly detection without additional model training or full computation.

Core claim

The paper claims that the combined neural activations of a hidden layer have class-specific distributional properties. It defines multiple statistical measures to compute how far a test sample's activations deviate from such distributions. Using benchmark datasets, it demonstrates PAD-based measures for uncertainty estimates, competitive inferencing accuracy, and out-of-distribution isolation, independent of any training technique.

What carries the argument

Prior Activation Distribution (PAD), a representation of typical hidden-layer activation patterns that supports statistical deviation measures for test samples.

If this is right

  • PAD-based measures derive fine-grained uncertainty estimates for inferences.
  • They provide inferencing accuracy competitive with alternatives that require execution of the full pipeline.
  • They reliably isolate out-of-distribution test samples.
  • These capabilities hold independent of any training technique.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • PAD deviation could support early or partial network evaluation for faster decisions in constrained environments.
  • The approach might extend to non-classification tasks such as regression by adapting the distributional measures.
  • Integrating PAD with ensemble methods could further refine uncertainty without extra training passes.

Load-bearing premise

The combined activations across hidden layer units exhibit class-specific distributional properties that can be reliably captured by statistical deviation measures from a prior distribution.

What would settle it

A demonstration that PAD deviation scores show no correlation with actual classification errors or fail to separate in-distribution from out-of-distribution samples on datasets such as MNIST or CIFAR10 would disprove the utility claims.

Figures

Figures reproduced from arXiv: 1907.02711 by Archan Misra, Lakmal Meegahapola, Lance Kaplan, Vengateswaran Subramaniam.

Figure 1
Figure 1. Figure 1: Examples for activation distributions of hidden-units of a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example from Modified-MNIST dataset [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Softmax (MA1 model): Rota￾tional MNIST [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: MNIST (MA3): Coverage vs. Accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

In this paper, we introduce the concept of Prior Activation Distribution (PAD) as a versatile and general technique to capture the typical activation patterns of hidden layer units of a Deep Neural Network used for classification tasks. We show that the combined neural activations of such a hidden layer have class-specific distributional properties, and then define multiple statistical measures to compute how far a test sample's activations deviate from such distributions. Using a variety of benchmark datasets (including MNIST, CIFAR10, Fashion-MNIST & notMNIST), we show how such PAD-based measures can be used, independent of any training technique, to (a) derive fine-grained uncertainty estimates for inferences; (b) provide inferencing accuracy competitive with alternatives that require execution of the full pipeline, and (c) reliably isolate out-of-distribution test samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Prior Activation Distribution (PAD) as a general technique to capture typical activation patterns of hidden layer units in DNNs for classification. It claims that the combined activations of a hidden layer exhibit class-specific distributional properties, defines statistical measures of deviation from per-class priors, and shows that these measures (independent of training method) can derive fine-grained uncertainty estimates, yield inference accuracy competitive with full-pipeline methods, and isolate out-of-distribution samples. Experiments are reported on MNIST, CIFAR10, Fashion-MNIST and notMNIST.

Significance. If the central claims hold with proper quantification, PAD would supply a training-independent, post-hoc representation for uncertainty and OOD tasks that could be applied to existing models. The multi-benchmark evaluation is a positive element. However, the absence of any reported separation metrics, baseline comparisons, or error bars in the provided information makes the practical significance difficult to gauge.

major comments (2)
  1. [Abstract] Abstract: the load-bearing premise that 'the combined neural activations of such a hidden layer have class-specific distributional properties' is asserted without any quantitative support (pairwise distances, classification accuracy of a model using only the deviation statistics, or comparison to softmax entropy). This directly undermines the three downstream claims (a)–(c).
  2. [Abstract] Abstract: no description is given of how the per-class prior is estimated, which statistical deviation measures are used, or any controls for layer choice and measure selection. Without these, it is impossible to assess whether the measures can overcome high inter-class overlap in activation space or whether results are post-hoc.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The two major points both concern the abstract; we agree that it can be strengthened for self-containment and will revise it. The full manuscript already contains the requested methodological details and empirical demonstrations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the load-bearing premise that 'the combined neural activations of such a hidden layer have class-specific distributional properties' is asserted without any quantitative support (pairwise distances, classification accuracy of a model using only the deviation statistics, or comparison to softmax entropy). This directly undermines the three downstream claims (a)–(c).

    Authors: The manuscript demonstrates the class-specific distributional properties empirically via the three downstream tasks (uncertainty, competitive inference accuracy, and OOD isolation) on four image benchmarks. We acknowledge that the abstract itself supplies no direct quantitative support such as pairwise distances or a standalone classifier on deviation statistics. We will revise the abstract to include a concise statement of the supporting experimental outcomes. revision: yes

  2. Referee: [Abstract] Abstract: no description is given of how the per-class prior is estimated, which statistical deviation measures are used, or any controls for layer choice and measure selection. Without these, it is impossible to assess whether the measures can overcome high inter-class overlap in activation space or whether results are post-hoc.

    Authors: The methods and experimental sections of the manuscript specify the per-class prior estimation procedure, the statistical deviation measures employed, and the layer/measure selection protocol. We agree the abstract should be self-contained on these points and will add a brief description of the estimation and measures used. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The provided abstract and description introduce PAD as a new representation, assert class-specific distributional properties of combined hidden activations, and define statistical deviation measures without any equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. Claims rest on empirical evaluation across benchmark datasets rather than reducing to inputs by construction. No steps match the enumerated circularity patterns, so the derivation chain has no detectable circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hidden layer activations possess class-specific distributional properties; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Combined neural activations of a hidden layer have class-specific distributional properties.
    Directly stated in abstract as the foundation for defining PAD and deviation measures.

pith-pipeline@v0.9.0 · 5679 in / 1136 out tokens · 20836 ms · 2026-05-25T02:41:42.013464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes. ICLR (Workshop) (2017)

  2. [2]

    Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. Structured Pruning of Deep Con- volutional Neural Networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) (2017)

  3. [3]

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. CVPR (2017)

  4. [4]

    Yaroslav Bulatov. 2011. notMNIST dataset. (2011). http://yaroslavvb.blogspot.com/ 2011/09/notmnist-dataset.html

  5. [5]

    Yen Pradeep Ravikumar Chih-Kuan Yeh, Joon Sik Kim

    Ian E.H. Yen Pradeep Ravikumar Chih-Kuan Yeh, Joon Sik Kim. 2018. Representer Point Selection for Explaining Deep Neural Networks. NIPS (2018)

  6. [6]

    François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. (2015)

  7. [7]

    François Chollet et al. 2019. CIFAR10 Sample Code - Keras Code Examples GitHub Repository. (2019). https://github.com/keras-team/keras/blob/master/examples/cifar10_ cnn.py

  8. [8]

    François Chollet et al. 2019. MNIST Sample Code - Keras Code Examples GitHub Repos- itory. (2019). https://github.com/keras-team/keras/blob/master/examples/ mnist_cnn.py

  9. [9]

    Yann Duan, Xi Chen, Rein Houthooft, John Schulman, and Peiter Abbeel. 2016. Benchmarking Deep Reinforcement Learning for Continuous Control. In 33 rd International Conference on Machine Learning (ICML)

  10. [10]

    Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org

  11. [11]

    Yarin Gal. 2016. Uncertainty in Deep Learning. PhD Thesis (2016)

  12. [12]

    Yarin Gal and Zoubin Ghahramani. 2016. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. ICLR (2016)

  13. [13]

    Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. International Conference on Machine Learning (ICML) (2016)

  14. [14]

    Shortliffe

    Jean Gordon and Edward H. Shortliffe. 1984. The Dempster-Shafer Theory of Evidence. Rule-Based Expert Systems: The MYCIN (1984)

  15. [15]

    A Graves. 2011. Practical variational inference for neural networks. NIPS (2011). 9

  16. [16]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015)

  17. [17]

    J M Hernandez-Lobato and R P Adams. 2015. Probabilistic backpropagation for scalable learning of bayesian neural networks. ICML (2015)

  18. [18]

    S Herzog and D. Ostwald. 2013. Experimental biology: Sometimes Bayesian statistics are better. Nature 494 (2013)

  19. [19]

    H.N.Io and C.B.Lee. 2017. Chatbots and conversational agents: A bibliometric analysis. International Conference on Industrial Engineering and Engineering Management (IEEM) (2017), 215–219. https://doi.org/10.1109/IEEM.2017.8289883

  20. [20]

    J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineer- ing 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55

  21. [21]

    Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. (2009). https://www.cs.toronto.edu/~kriz/cifar.html

  22. [22]

    Abhijeet Kumar. 2018. Achieving 90% accuracy in Object Recognition Task on CIFAR-10 Dataset with Keras: Convolutional Neural Networks. Applied Machine Learning Blog (2018). http://tiny.cc/c4os6y

  23. [23]

    AiOTA LABS. 2019. Quantifying Accuracy and SoftMax Prediction Confidence For Making Safe and Reliable Deep Neural Network Based AI System. UseJournal (2019)

  24. [24]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and Scal- able Predictive Uncertainty Estimation using Deep Ensembles. 31st Conference on Neural Information Processing Systems (NIPS) (2017)

  25. [25]

    Denker, and Sara A

    Yann LeCun, John S. Denker, and Sara A. Solla. 1990. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.). Morgan-Kaufmann, 598–605. http://papers.nips.cc/paper/250-optimal-brain-damage.pdf

  26. [26]

    Yann Lecunn, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature 521 (2015), 436–444

  27. [27]

    LeCunn, L

    Y . LeCunn, L. Bottou, Y . Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE (1998)

  28. [28]

    van der Laak, Bram van Ginneken, and Clara I

    Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis 42 (2017), 60 – 88. https://doi.org/10.1016/j.media.2017.07.005

  29. [29]

    Louizos and M

    C. Louizos and M. Welling. 2017. Multiplicative normalizing flows for variational bayesian neural networks. ICML (2017)

  30. [30]

    David JC MacKay. 1992. A practical Bayesian framework for backpropagation networks. Neural computation 4(3) (1992), 448–472

  31. [31]

    David JC MacKay. 1995. Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6(3) (1995), 469–505

  32. [32]

    Margaret Maynard-Reid. 2018. Fashion-MNIST with tf.Keras. (2018)

  33. [33]

    Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning Convolutional Neural Networks for Resource Efficient Inference.ICLR (2017)

  34. [34]

    R M. Neal. 1995. Bayesian learning for neural networks. PhD thesis, University of Toronto (1995). 10

  35. [35]

    Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

  36. [36]

    Regina Nuzzo. 2013. Statistical Errors. Nature 506(13) (2013), 150–152

  37. [37]

    Osband, J

    I. Osband, J. Aslanides, and A. Cassirer. 2018. Randomized Prior Functions for Deep Re- inforcement Learning. 32nd Conference on Neural Information Processing Systems (NIPS) (2018)

  38. [38]

    Manajit Pal. 2019. Deep Learning for Self-Driving Cars. Towards Data Science(2019)

  39. [39]

    Larrel Pinto and Abhinav Gupta. 2016. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours. IEEE International Conference on Robotics and Automation (ICRA), 3406–3413

  40. [40]

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. SVCCA: Sin- gular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Advances in neural information processing systems(NIPS) (2017)

  41. [41]

    Ritter, A

    H. Ritter, A. Botev, and D. Barber. 2018. A scalable laplace approximation for Neural Networks. ICLR (2018)

  42. [42]

    Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential Deep Learning to Quan- tify Classification Uncertainty. 32nd Conference on Neural Information Processing Systems (NeurIPS) (2018)

  43. [43]

    Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2015. Striving for Simplicity: The All Convolutional Net. ICLR Workshop (2015)

  44. [44]

    Srivastava, G

    N. Srivastava, G. Hinton, A. rizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research (2014)

  45. [45]

    Mattias Teye, Hossein Azizpour, and Kevin Smith. 2018. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks. International Conference on Machine Learning (ICML) (2018)

  46. [46]

    van Rossum

    G. van Rossum. 1995. Python tutorial. Technical Report CS-R9526. Centrum voor Wiskunde en Informatica (CWI), Amsterdam. Software available from python.org

  47. [47]

    C. K. I. Williams. 1997. Computing with infinite networks. NIPS (1997)

  48. [48]

    Xiaolin Hu Jian Yang Xiang Li, Shuo Chen. 2018. Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift. arXiv:1801.05134 (2018)

  49. [49]

    Han Xiao, Kashif Rasul, and Roland V ollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. (2017). arXiv:cs.LG/1708.07747

  50. [50]

    Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017. DeepIoT: Compressing Deep Neural Network Structures for Sensing Systems with a Compressor-Critic Framework. SenSys (2017)

  51. [51]

    Yosinski, J

    J. Yosinski, J. Clune, Y . Bengio, and H. Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems(NIPS) (2014), 3320–3328

  52. [52]

    M. D. Zeiler and R Fergus. 2014. Visualizing and understanding convolutional networks. European conference on computer vision (ECCV) (2014), 818–833. 11