pith. sign in

arxiv: 2411.05183 · v4 · submitted 2024-11-07 · 💻 cs.CV · cs.LG

Why CNN Features Are not Gaussian: A Statistical Anatomy of Deep Representations

Pith reviewed 2026-05-23 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords CNN featuresstatistical distributionWeibull distributiontail dependencedeep representationsnon-Gaussiancopula modelingMatthew process
0
0 comments X

The pith

CNN feature activations deviate substantially from Gaussian and follow long-tailed Weibull distributions instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that internal activations in convolutional neural networks follow Gaussian distributions. Systematic measurements across multiple architectures and datasets show that the activations instead exhibit long tails best matched by Weibull and related families. Tail length grows with network depth while upper-tail dependence appears between pairs of features. These patterns contradict expectations from the central limit theorem and point instead to a process that concentrates semantic information in the extremes. The results indicate that networks built this way reduce noise effectively but handle outliers poorly, so density models should use long-tailed priors with upper-tail dependence rather than Gaussians.

Core claim

Deep convolutional neural networks produce internal feature activations whose distributions are substantially non-Gaussian and instead follow long-tailed families such as the Weibull. A new Discretized Characteristic Function Copula method reveals increasing tail length with depth and the emergence of upper-tail dependence between feature pairs. These patterns indicate a Matthew process that concentrates semantic signal in the tails, making the networks effective at noise reduction but less so at handling outliers.

What carries the argument

The Discretized Characteristic Function Copula (DCF-Copula) method, which models multivariate feature dependencies and exposes upper-tail dependence not captured by Gaussian assumptions.

If this is right

  • CNNs reduce noise effectively yet perform poorly on outlier removal tasks.
  • Long-tailed upper-tail-dependent priors should replace Gaussian priors when modeling deep feature densities.
  • Tail length increases with network depth.
  • Upper-tail dependence emerges between feature pairs as depth grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar non-Gaussian tail behavior may appear in transformer or other non-convolutional deep networks.
  • Feature-based density estimation or generative models could gain accuracy by adopting these tail-dependent priors.
  • Outlier-sensitive applications that rely on deep features may require revised statistical assumptions.

Load-bearing premise

The empirical fits to Weibull and related families on the chosen architectures and datasets generalize beyond the tested cases and the observed tail behavior is not an artifact of the specific activation functions or normalization layers used.

What would settle it

Observing that feature activations across layers in a new deep CNN fit a Gaussian distribution closely on multiple standard datasets would contradict the central claim.

Figures

Figures reproduced from arXiv: 2411.05183 by David Chapman, Parniyan Farvardin.

Figure 1
Figure 1. Figure 1: Comparison of orthogonal basis functions over the uniform interval ( [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of ResNet-18 (left) and VGG-19 (right) deep feature layers selected for [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Percent of nonzero features per layer. We now quantitatively compare the goodness of fit of five parametric distributions to the marginals of the non-zero features. The distributions that we compare are the uniform distribu￾tion, the Gaussian distribution, the gamma distribution, and the Weibull distribution. The optimal parameters of these distributions are determined using the method of stochastic hill c… view at source ↗
Figure 4
Figure 4. Figure 4: Histogram of marginal density for pre-trained ResNet-18 on Imagenette2 for features [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative goodness-of-fit of five standard distributions to the feature marginals for [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tail parameter analysis for CIFAR-10, CIFAR-100, Imagenette2, and MNIST, across [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Empirical optimal Weibull θ tail parameters per layer (solid) versus theoretical estimates (dashed). We also observe that VGG-19 has significantly longer tails than comparable ResNet models for the deep intermediate layers. Across all datasets, all of the deep intermediate layers for VGG-19 exhibit long tails. For ResNet, only the deepest intermediate layer is long-tailed, with the exception of MNIST ResNe… view at source ↗
Figure 8
Figure 8. Figure 8: Select copula interdependence for pairwise features for 5 layers of ResNet-18 over Ima [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Select copula interdependence for pairwise features for 5 layers of ResNet-18 over Ima [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
read the original abstract

Deep convolutional neural networks (CNNs) are commonly analyzed through geometric and linear-algebraic perspectives, yet the statistical distribution of their internal feature activations remains poorly understood. In many applications, deep features are implicitly treated as Gaussian when modeling densities. In this work, we empirically examine this assumption and show that it does not accurately describe the distribution of CNN feature activations. Through a systematic study across multiple architectures and datasets, we find that the feature activations deviate substantially from Gaussian and are better characterized by Weibull and related long-tailed distributions. We further introduce a novel Discretized Characteristic Function Copula (DCF-Copula) method to model multivariate feature dependencies. We find that tail-length increases with network depth and that upper-tail dependence emerges between feature pairs. These statistical findings are not consistent with the Central Limit Theorem, and are instead indicative of a Matthew process that progressively concentrates semantic signal within the tails. These statistical findings suggest that CNNs are excellent at noise reduction, yet poor at outlier removal tasks. We recommend the use of long-tailed upper-tail-dependent priors as opposed to Gaussian priors for accurately CNN deep feature density. Code available at https://github.com/dchapman-prof/DCF-Copula

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper empirically analyzes the statistical distributions of internal feature activations in CNNs across architectures and datasets. It claims these activations deviate substantially from Gaussianity and are better characterized by Weibull and other long-tailed families. The work introduces the Discretized Characteristic Function Copula (DCF-Copula) to capture multivariate dependencies, reports increasing tail length with depth and emerging upper-tail dependence between features, interprets the results as inconsistent with the Central Limit Theorem and instead indicative of a Matthew process that concentrates semantic signal in tails, and recommends long-tailed upper-tail-dependent priors over Gaussian ones for feature density modeling. Reproducible code is provided.

Significance. If the empirical distribution findings and tail-dependence results hold under scrutiny, the work supplies a useful statistical characterization of deep representations that questions the routine Gaussian assumption in density estimation and feature modeling tasks. The DCF-Copula is presented as a novel modeling tool for tail dependencies. Explicit code release supports reproducibility and verification of the reported fits.

major comments (2)
  1. [Abstract / CLT discussion] The interpretive claim (Abstract and the section contrasting findings with the CLT) that the long-tailed behavior 'is not consistent with the Central Limit Theorem' is load-bearing for the Matthew-process interpretation, yet the manuscript provides no derivation or simulation establishing that CLT conditions (independent or weakly dependent summands with finite variance) would be expected to produce Gaussian activations given the actual generative process: convolutions over spatially/channel-dependent inputs, pointwise nonlinearities (ReLU), and normalization layers.
  2. [Empirical methodology] § on empirical distribution fitting: the reported superiority of Weibull and related families rests on distribution fitting whose details (per-layer and per-feature sample sizes, goodness-of-fit tests employed, handling of zero activations from ReLU, and multiple-testing correction across channels and layers) are not reported, undermining assessment of whether the tail-length and dependence claims are robust or artifacts of the chosen architectures/normalizations.
minor comments (1)
  1. [Methods] The formal definition and discretization procedure for the DCF-Copula could be stated more explicitly with pseudocode or equations to aid implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and robustness of our empirical findings on CNN feature distributions. We respond to each major comment below and will revise the manuscript to incorporate additional details and discussion as outlined.

read point-by-point responses
  1. Referee: [Abstract / CLT discussion] The interpretive claim (Abstract and the section contrasting findings with the CLT) that the long-tailed behavior 'is not consistent with the Central Limit Theorem' is load-bearing for the Matthew-process interpretation, yet the manuscript provides no derivation or simulation establishing that CLT conditions (independent or weakly dependent summands with finite variance) would be expected to produce Gaussian activations given the actual generative process: convolutions over spatially/channel-dependent inputs, pointwise nonlinearities (ReLU), and normalization layers.

    Authors: We agree that the manuscript would benefit from a more explicit justification of why the CLT does not apply here. Our core claim remains empirical: the observed activations exhibit long tails inconsistent with Gaussianity, which we contrast with the CLT's typical prediction under standard assumptions of independent or weakly dependent summands with finite variance. The CNN generative process (convolutions inducing spatial/channel dependence, ReLU introducing asymmetry and potential infinite moments, and normalizations) violates these conditions, supporting the Matthew-process reading. To strengthen this, the revised version will include a short discussion of the relevant CLT conditions alongside a minimal simulation contrasting summed independent finite-variance variables (yielding approximate Gaussianity) with a simplified ReLU-convolution process (reproducing heavy tails). This addition addresses the load-bearing nature of the claim without altering the empirical results. revision: yes

  2. Referee: [Empirical methodology] § on empirical distribution fitting: the reported superiority of Weibull and related families rests on distribution fitting whose details (per-layer and per-feature sample sizes, goodness-of-fit tests employed, handling of zero activations from ReLU, and multiple-testing correction across channels and layers) are not reported, undermining assessment of whether the tail-length and dependence claims are robust or artifacts of the chosen architectures/normalizations.

    Authors: We acknowledge that these methodological details were insufficiently reported and will expand the relevant section in revision. Per-feature sample sizes are determined by aggregating over spatial dimensions and batch size, yielding approximately 10^4–10^5 observations per channel (varying by layer depth and input resolution). Fitting used maximum-likelihood estimation for candidate distributions (Gaussian, Weibull, log-normal, etc.), with model selection based on AIC/BIC and visual Q-Q plot inspection focused on tails; Kolmogorov-Smirnov tests were applied for quantitative comparison where sample sizes permitted. ReLU-induced zeros were handled by separately modeling the point mass at zero and fitting the continuous positive support to the nonzero activations. No formal multiple-testing correction was applied, as the analysis emphasizes qualitative trends (increasing tail length and dependence with depth) across architectures rather than per-channel hypothesis tests. The revision will add an explicit subsection with these specifications, sample-size tables, and code references to allow independent verification of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely observational with independent modeling contribution

full rationale

The paper conducts an empirical statistical analysis of CNN feature activations across architectures and datasets, fitting distributions (Weibull etc.) and introducing the DCF-Copula method for dependencies. No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. The interpretive contrast with CLT and reference to a Matthew process are post-hoc characterizations of observed data rather than load-bearing derivations. Self-citations are absent from the provided text, and the central claims rest on direct empirical measurements rather than fitted parameters renamed as predictions or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard statistical assumptions about i.i.d. sampling of activations and on the choice of parametric families (Weibull, etc.) whose parameters are fitted to data; the new DCF-Copula is an invented modeling entity whose independent evidence is the empirical fit itself.

free parameters (1)
  • Weibull shape and scale per layer
    Fitted to empirical activation histograms; central to the claim that Weibull outperforms Gaussian.
axioms (1)
  • domain assumption Activations within a layer are treated as i.i.d. samples from a common marginal distribution
    Invoked when fitting univariate distributions and when constructing the copula.
invented entities (1)
  • DCF-Copula no independent evidence
    purpose: Model multivariate upper-tail dependence among feature activations
    New construction introduced to capture observed tail dependence; independent evidence is the reported empirical improvement over Gaussian copulas.

pith-pipeline@v0.9.0 · 5739 in / 1261 out tokens · 33452 ms · 2026-05-23T17:02:49.134161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    feature activations deviate substantially from Gaussian and are better characterized by Weibull and related long-tailed distributions... tail-length increases with network depth and that upper-tail dependence emerges between feature pairs... indicative of a Matthew process that progressively concentrates semantic signal within the tails

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    CAWI replaces standard random initialization of input-to-hidden weights in randomized neural networks with samples drawn from a data-fitted copula that preserves observed feature dependencies, yielding consistent accu...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    A class of bivariate distributions including the bivariate logistic

    Mir M Ali, NN Mikhail, and M Safiul Haq. A class of bivariate distributions including the bivariate logistic. Journal of multivariate analysis , 8(3):405–412, 1978

  2. [2]

    Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023

  3. [3]

    A characteristic function approach to deep implicit generative modeling

    Abdul Fatir Ansari, Jonathan Scarlett, and Harold Soh. A characteristic function approach to deep implicit generative modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

  4. [4]

    Portfolio optimization through hybrid deep learning and genetic algorithms vine copula-garch-evt-cvar model

    Rihab Bedoui, Ramzi Benkraiem, Khaled Guesmi, and Islem Kedidi. Portfolio optimization through hybrid deep learning and genetic algorithms vine copula-garch-evt-cvar model. Tech- nological Forecasting and Social Change, 197:122887, 2023

  5. [5]

    Recent development in copula and its applications to the energy, forestry and environmental sciences

    M Ishaq Bhatti and Hung Quang Do. Recent development in copula and its applications to the energy, forestry and environmental sciences. International Journal of Hydrogen Energy , 44(36):19453–19473, 2019

  6. [6]

    Novelty detection and neural network validation

    Christopher M Bishop. Novelty detection and neural network validation. In ICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3 , pages 789–794. Springer, 1993

  7. [7]

    Variational inference with continuously-indexed normalizing flows

    Anthony Caterini, Rob Cornish, Dino Sejdinovic, and Arnaud Doucet. Variational inference with continuously-indexed normalizing flows. In Uncertainty in Artificial Intelligence , pages 44–53. PMLR, 2021

  8. [8]

    Anomaly detection: A survey

    Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR) , 41(3):1–58, 2009

  9. [9]

    Under- standing and improving feature learning for out-of-distribution generalization

    Yongqiang Chen, Wei Huang, Kaiwen Zhou, Yatao Bian, Bo Han, and James Cheng. Under- standing and improving feature learning for out-of-distribution generalization. Advances in Neural Information Processing Systems, 36, 2024

  10. [10]

    Probabilistic circuits: A unifying framework for tractable probabilistic models, 2020

    YooJung Choi, Antonio Vergari, and Guy Van den Broec. Probabilistic circuits: A unifying framework for tractable probabilistic models, 2020

  11. [11]

    A model for association in bivariate life tables and its application in epidemi- ological studies of familial tendency in chronic disease incidence

    David G Clayton. A model for association in bivariate life tables and its application in epidemi- ological studies of familial tendency in chronic disease incidence. Biometrika, 65(1):141–151, 1978

  12. [12]

    Feature density estimation for out-of-distribution detection via normalizing flows

    Evan D Cook, Marc-Antoine Lavoie, and Steven L Waslander. Feature density estimation for out-of-distribution detection via normalizing flows. arXiv preprint arXiv:2402.06537 , 2024

  13. [13]

    Archimedean copula and contagion modeling in epidemiology

    Jacques Demongeot, Mohamad Ghassani, Mustapha Rachdi, Idir Ouassou, and Carla Taram- asco. Archimedean copula and contagion modeling in epidemiology. Networks and Heteroge- neous Media, 8(1):149–170, 2013

  14. [14]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 32

  15. [15]

    The mnist database of handwritten digit images for machine learning research

    Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012

  16. [16]

    Orthogonal gradient descent for continual learning

    Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International conference on artificial intelligence and statistics , pages 3762–3773. PMLR, 2020

  17. [17]

    Does learning require memorization? a short tale about a long tail

    Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing , pages 954–959, 2020

  18. [18]

    What neural networks memorize and why: Discovering the long tail via influence estimation

    Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems , 33:2881–2891, 2020

  19. [19]

    The empirical characteristic function and its applications

    Andrey Feuerverger and Roman A Mureika. The empirical characteristic function and its applications. The annals of Statistics , pages 88–97, 1977

  20. [20]

    On the simultaneous associativity of f (x, y) and x+ y- f (x, y)

    Maurice J Frank. On the simultaneous associativity of f (x, y) and x+ y- f (x, y). Aequationes mathematicae, 19:194–226, 1979

  21. [21]

    A low effort approach to structured cnn design using pca

    Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low effort approach to structured cnn design using pca. IEEE Access, 8:1347–1360, 2019

  22. [22]

    Integrating flexible normalization into mi- dlevel representations of deep convolutional neural networks.Neural computation, 31(11):2138– 2176, 2019

    Luis Gonzalo S´ anchez Giraldo and Odelia Schwartz. Integrating flexible normalization into mi- dlevel representations of deep convolutional neural networks.Neural computation, 31(11):2138– 2176, 2019

  23. [23]

    Generative adversarial networks.Communications of the ACM , 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM , 63(11):139–144, 2020

  24. [24]

    Out-of-distribution de- tection is not all you need

    Joris Gu´ erin, Kevin Delmas, Raul Ferreira, and J´ er´ emie Guiochet. Out-of-distribution de- tection is not all you need. In Proceedings of the AAAI conference on artificial intelligence , volume 37, pages 14829–14837, 2023

  25. [25]

    Bivariate exponential distributions

    Emil J Gumbel. Bivariate exponential distributions. Journal of the American Statistical Association, 55(292):698–707, 1960

  26. [26]

    Large sample properties of generalized method of moments estimators

    Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society , pages 1029–1054, 1982

  27. [27]

    A brief survey on semantic segmentation with deep learning

    Shijie Hao, Yuan Zhou, and Yanrong Guo. A brief survey on semantic segmentation with deep learning. Neurocomputing, 406:302–321, 2020

  28. [28]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  29. [29]

    What shapes feature representations? exploring datasets, architectures, and training

    Katherine Hermann and Andrew Lampinen. What shapes feature representations? exploring datasets, architectures, and training. Advances in Neural Information Processing Systems , 33:9995–10006, 2020. 33

  30. [30]

    Imagenette: A smaller subset of 10 easily classified classes from imagenet, March 2019

    Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet, March 2019

  31. [31]

    Spatio-temporal wind speed prediction based on clayton copula function with deep learning fusion

    Yu Huang, Bingzhe Zhang, Huizhen Pang, Biao Wang, Kwang Y Lee, Jiale Xie, and Yupeng Jin. Spatio-temporal wind speed prediction based on clayton copula function with deep learning fusion. Renewable energy, 192:526–536, 2022

  32. [32]

    Detect- ing out-of-distribution data through in-distribution class prior

    Xue Jiang, Feng Liu, Zhen Fang, Hong Chen, Tongliang Liu, Feng Zheng, and Bo Han. Detect- ing out-of-distribution data through in-distribution class prior. In International Conference on Machine Learning, pages 15067–15088. PMLR, 2023

  33. [33]

    Multivariate extreme-value distributions with applications to environmental data

    Harry Joe. Multivariate extreme-value distributions with applications to environmental data. Canadian Journal of Statistics , 22(1):47–64, 1994

  34. [34]

    A review of copula methods for measuring uncertainty in finance and eco- nomics

    Jong-Min Kim. A review of copula methods for measuring uncertainty in finance and eco- nomics. Quantitative Bio-Science, 39(2):81–90, 2020

  35. [35]

    Auto-Encoding Variational Bayes

    Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013

  36. [36]

    Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

    Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

  37. [37]

    Explaining distributed neural acti- vations via unsupervised learning

    Soheil Kolouri, Charles E Martin, and Heiko Hoffmann. Explaining distributed neural acti- vations via unsupervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 20–28, 2017

  38. [38]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  39. [39]

    Pytorch-cifar: optimized cnn aarchitectures for cifar10, 2017

    Liu Kuang. Pytorch-cifar: optimized cnn aarchitectures for cifar10, 2017

  40. [40]

    Perfect density models cannot guarantee anomaly detec- tion

    Charline Le Lan and Laurent Dinh. Perfect density models cannot guarantee anomaly detec- tion. Entropy, 23(12):1690, 2021

  41. [41]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018

  42. [42]

    Mmd gan: Towards deeper understanding of moment matching network

    Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. Mmd gan: Towards deeper understanding of moment matching network. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  43. [43]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems , 34:9694–9705, 2021

  44. [44]

    Generative moment matching networks

    Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research , pages 1718–1727, Lille, France, 07–09 Jul 2015. PMLR. 34

  45. [45]

    Deep archimedean copulas

    Chun Kai Ling, Fei Fang, and J Zico Kolter. Deep archimedean copulas. Advances in Neural Information Processing Systems, 33:1535–1545, 2020

  46. [46]

    Unsupervised anomaly detection by robust density estimation

    Boyang Liu, Pang-Ning Tan, and Jiayu Zhou. Unsupervised anomaly detection by robust density estimation. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 4101–4108, 2022

  47. [47]

    Energy-based out-of-distribution detection

    Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 21464–21475. Curran Associates, Inc., 2020

  48. [48]

    Hybrid design of cnn and vision transformer: A review

    Hanhua Long. Hybrid design of cnn and vision transformer: A review. In Proceedings of the 2024 7th International Conference on Computer Information Science and Artificial Intel- ligence, pages 121–127, 2024

  49. [49]

    A method of moments embedding constraint and its application to semi-supervised learning

    Michael Majurski, Sumeet Menon, Parniyan Favardin, and David Chapman. A method of moments embedding constraint and its application to semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7809–7818, 2024

  50. [50]

    13 financial applications of stable distributions

    J Huston McCulloch. 13 financial applications of stable distributions. Handbook of statistics, 14:393–425, 1996

  51. [51]

    Do Deep Generative Models Know What They Don't Know?

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi- narayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018

  52. [52]

    An introduction to copulas

    Roger B Nelsen. An introduction to copulas. Springer, 2006

  53. [53]

    Learning deconvolution network for semantic segmentation

    Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015

  54. [54]

    Multivariate elliptically contoured stable distributions: theory and estimation

    John Nolan. Multivariate elliptically contoured stable distributions: theory and estimation. Computational Statistics, 28(5):2067–2089, 2013

  55. [55]

    Modeling and forecasting short-term power load with copula model and deep belief network

    Tinghui Ouyang, Yusen He, Huajin Li, Zhiyu Sun, and Stephen Baek. Modeling and forecasting short-term power load with copula model and deep belief network. IEEE Transactions on Emerging Topics in Computational Intelligence , 3(2):127–136, 2019

  56. [56]

    Complexity matters: Dynamics of feature learning in the presence of spurious correlations

    GuanWen Qiu, Da Kuang, and Surbhi Goel. Complexity matters: Dynamics of feature learning in the presence of spurious correlations. arXiv preprint arXiv:2403.03375 , 2024

  57. [57]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

  58. [58]

    Variational inference with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Inter- national conference on machine learning , pages 1530–1538. PMLR, 2015. 35

  59. [59]

    Modeling the distribution of normal data in pre-trained deep features for anomaly detection

    Oliver Rippel, Patrick Mertens, and Dorit Merhof. Modeling the distribution of normal data in pre-trained deep features for anomaly detection. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 6726–6733. IEEE, 2021

  60. [60]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10684–10695, 2022

  61. [61]

    Gradient projection memory for continual learn- ing

    Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learn- ing. In International Conference on Learning Representations, 2021

  62. [62]

    Learning to share visual appearance for multiclass object detection

    Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenenbaum. Learning to share visual appearance for multiclass object detection. In CVPR 2011, pages 1481–1488, 2011

  63. [63]

    Opening the Black Box of Deep Neural Networks via Information

    Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 , 2017

  64. [64]

    Copula-based data augmentation on a deep learning architecture for cardiac sensor fusion.IEEE journal of biomedical and health informatics, 25(7):2521–2532, 2020

    Diogo Silva, Steffen Leonhardt, and Christoph Hoog Antink. Copula-based data augmentation on a deep learning architecture for cardiac sensor fusion.IEEE journal of biomedical and health informatics, 25(7):2521–2532, 2020

  65. [65]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014

  66. [66]

    Fonctions de r´ epartition ` a n dimensions et leurs marges

    M Sklar. Fonctions de r´ epartition ` a n dimensions et leurs marges. In Annales de l’ISUP , volume 8, pages 229–231, 1959

  67. [67]

    Feature distribution matching for federated domain generalization

    Yuwei Sun, Ng Chong, and Hideya Ochiai. Feature distribution matching for federated domain generalization. In Asian Conference on Machine Learning , pages 942–957. PMLR, 2023

  68. [68]

    Understanding priors in bayesian neural networks at the unit level

    Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, and Julyan Arbel. Understanding priors in bayesian neural networks at the unit level. In International Conference on Machine Learning , pages 6458–6467. PMLR, 2019

  69. [69]

    A survey on video diffusion models

    Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models. ACM Computing Surveys , 2023

  70. [70]

    Diffusion models: A comprehensive survey of methods and applications

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys , 56(4):1–39, 2023

  71. [71]

    Empirical characteristic function estimation and its applications

    Jun Yu. Empirical characteristic function estimation and its applications. Econometric reviews, 23(2):93–123, 2004

  72. [72]

    Characteristic circuits

    Zhongjie Yu, Martin Trapp, and Kristian Kersting. Characteristic circuits. Advances in Neural Information Processing Systems, 36:34074–34086, 2023

  73. [73]

    Feature extraction and image retrieval based on alexnet

    Zheng-Wu Yuan and Jun Zhang. Feature extraction and image retrieval based on alexnet. In Eighth International Conference on Digital Image Processing (ICDIP 2016) , volume 10033, pages 65–69. SPIE, 2016

  74. [74]

    Mathematical functions and their approximations

    Luke L Yudell. Mathematical functions and their approximations . Academic Press, New York, 1975. 36

  75. [75]

    A systematic review on long-tailed learning

    Chongsheng Zhang, George Almpanidis, Gaojuan Fan, Binquan Deng, Yanbo Zhang, Ji Liu, Aouaidjia Kamel, Paolo Soda, and Jo˜ ao Gama. A systematic review on long-tailed learning. IEEE Transactions on Neural Networks and Learning Systems , 2025

  76. [76]

    Understanding failures in out-of- distribution detection with deep generative models

    Lily Zhang, Mark Goldstein, and Rajesh Ranganath. Understanding failures in out-of- distribution detection with deep generative models. In International Conference on Machine Learning, pages 12427–12436. PMLR, 2021

  77. [77]

    Interpretable convolutional neural net- works

    Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 8827–8836, 2018

  78. [78]

    Capturing long-tail distributions of object subcategories

    Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distributions of object subcategories. In 2014 IEEE Conference on Computer Vision and Pattern Recognition , pages 915–922, 2014

  79. [79]

    Boosting out-of-distribution detection with typical features

    Yao Zhu, YueFeng Chen, Chuanlong Xie, Xiaodan Li, Rong Zhang, Hui Xue ', Xiang Tian, bolun zheng, and Yaowu Chen. Boosting out-of-distribution detection with typical features. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 20758–20769. Curran Associates, In...

  80. [80]

    This value measures how well the trained parametric model explains the test histogram of filter d

    Compute the KL-divergence for the non-zeros samples of each filter d within the target layer of D filters, We denote this KL-divergence as KLd. This value measures how well the trained parametric model explains the test histogram of filter d

Showing first 80 references.