pith. sign in

arxiv: 2605.20534 · v1 · pith:I7OWZ4ORnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· stat.ML

Axiomatizing Neural Networks via Pursuit of Subspaces

Pith reviewed 2026-05-21 06:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords neural networksaxiomatic frameworksubspace pursuitdeep learning theorygeometric postulatesrepresentation learninggeneralization
0
0 comments X

The pith

Neural networks operate according to geometric postulates about pursuing subspaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Pursuit of Subspaces hypothesis as an axiomatic framework that models neural network behavior through a set of geometric postulates. It aims to explain representation, computation, and generalization in both shallow and deep networks as consequences of these rules, much like axioms clarify the properties of classical geometry. A reader would care if this holds because it could replace black-box views with a principled geometric account that addresses why networks succeed and how they generalize. The approach unifies explanations across architectures by deriving observable behaviors directly from the postulates.

Core claim

The PoS axioms together with their derived consequences provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures and yield geometric explanations for fundamental questions in deep learning.

What carries the argument

The Pursuit of Subspaces (PoS) hypothesis, a collection of geometric postulates that treat network dynamics as the systematic pursuit of subspaces.

If this is right

  • The framework supplies geometric accounts of how representations form in network layers.
  • It explains the effects of architectural choices such as depth and width through subspace mechanisms.
  • Generalization behavior follows as a direct consequence of the geometric postulates rather than separate statistical arguments.
  • Both shallow linear networks and deep nonlinear ones fall under the same set of axioms and derived results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the axioms prove faithful, designers could use them to derive new architectures instead of relying on empirical search.
  • The approach may connect to existing geometric ideas in machine learning such as manifold assumptions without requiring additional machinery.
  • A direct test would involve checking whether subspace-pursuit predictions match the internal activations of trained networks on simple datasets.
  • The same postulates might extend to explain certain behaviors in other parameterized models beyond standard neural networks.

Load-bearing premise

Neural network behavior can be captured by a small set of geometric postulates whose consequences are both non-trivial and faithful to observed network dynamics.

What would settle it

A clear case of network training or generalization where the observed representations and performance cannot be derived from or predicted by the PoS axioms.

Figures

Figures reproduced from arXiv: 2605.20534 by Felix Rojas Casadiego, Marcel van Gerven, Mehmet Yamac, Mert Duman, Moncef Gabbouj, Serkan Kiranyaz, Ugur Akpinar.

Figure 1
Figure 1. Figure 1: An example manifold and local coordinates. Beyond these basic examples, new manifolds can be obtained by forming Cartesian products, which are known as product manifolds. If M1 and M2 are manifolds of dimensions k1 and k2, then their product M = M1 × M2 is itself a manifold of di￾mension k1 + k2. The local charts of M are given by combining charts from M1 and M2. A funda￾mental example is the n-torus, defi… view at source ↗
Figure 2
Figure 2. Figure 2: Transversal intersections. (a) Linear subspaces. (b) Curved submanifolds. Let F : X → Y be a smooth map between manifolds, and let Z ⊂ Y be a smooth submanifold. We say that F is transversal to Z, written F ⋔ Z, if for every point x ∈ X with F(x) ∈ Z, we have Im(dFx) + TF (x)Z = TF (x)Y. (3) If this condition holds, then the preimage F −1 (Z) is a smooth submanifold of X . Moreover, the codimen￾sion of F −… view at source ↗
Figure 3
Figure 3. Figure 3: Nonlinear orthogonal projection onto a manifold. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Uniqueness vs. stability. (a) Null-space [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sparsity models. (a) Conventional sparsity. (b) Group sparsity. The recovery guarantees for k-sparse signals re￾quire controlling the kmax-order constant (equal to 2k in the classical setting [7]), denoted δkmax (D). The intuition parallels the null-space analysis above: to avoid Dx′ = Dx′′, we require D(x ′ − x ′′) ̸= 0 for all distinct k-sparse vectors, which in turn requires that D not annihilate any no… view at source ↗
Figure 6
Figure 6. Figure 6: Autoencoder as a geometric coordinate model. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Classical vs. PoS-based views of neural net [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Compact vs. non-compact representations. Linear models learn a single global span, which [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Vanilla Networks vs. Skip Con￾nections. When MD is a single smooth manifold, the operator P ⊥ extracts components orthogonal to the tangent space TP (s)MD, i.e., elements of the normal space NP (s)MD. In the special case where MD is a flat k-dimensional subspace, this reduces to the classical null–space pro￾jection onto the orthogonal complement span(D) ⊥, so that P(s) = s − P ⊥(s) is simply the standard o… view at source ↗
Figure 10
Figure 10. Figure 10: Projection on sub-manifold MDj can be realized via its own encoder-decoder or via isometric mapping from MDj to MDi . Remark 1 (Isometry Invariance of Nonlinear Or￾thogonal Projection). (See [19].) Let M be a nonempty subset of the Euclidean space R n, not necessarily a manifold; in particular, M may be a single smooth submanifold or a union of such submanifolds. Let T : R n → R n be an isometry. If P : R… view at source ↗
Figure 11
Figure 11. Figure 11: Isometry action on a union of subman￾ifolds. Left: a finite union M = SL i=1 MDi , where MD1 is already learned as a canonical component. Middle: new samples (red points) do not lie on MD1 but are assumed to be￾long to an isometric image g(MD1 ). Learn￾ing g −1 maps these samples back onto MD1 , implicitly identifying the transformed manifold g(MD1 ). Right: by the invariance identity Pg(M) = g ◦ PM ◦ g −… view at source ↗
Figure 12
Figure 12. Figure 12: Geometric disentangle￾ment of two submanifolds. Left: The tangent space at the intersec￾tion Tx(MDi ∩ MDj ) splits into submanifold-specific residual directions TxMDi,R and TxMDj ,R, which a trained network pushes to be as orthog￾onal as possible to ensure stable, unam￾biguous projection. Right: Linearized view where tangent spaces are approx￾imated by subspace spans span(Di), linking the geometric decomp… view at source ↗
Figure 13
Figure 13. Figure 13: Geometric illustration of residual-nullspace interference and selective annihilation. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (a) When the angle between the two subspaces is [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Isometric folding for out-of-union samples. A learnable transform [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Group-action view of the learned representation: projection onto the canonical manifold union M and its orbit G·M generated by learnable trans￾formations. enumerating infinitely many submanifolds, this perspective shows that they arise from a canonical manifold (or canonical union) through symmetry transformations. Let G be a group of learnable isometries acting on R n. The representation then naturally f… view at source ↗
Figure 17
Figure 17. Figure 17: Deep networks as hierarchical manifold generators: successive transformation families [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: PoS module. The module consists of an input transformation (Tin), a projection onto a structured subspace, and an output transforma￾tion (Tout), together with a residual branch that captures components not explained by the current subspace. This structure serves as a fundamen￾tal building block for hierarchical composition in deep networks. We now introduce the Pursuit–of–Subspaces (PoS) module, illustrat… view at source ↗
Figure 19
Figure 19. Figure 19: Zero-shot ECG anomaly detection via PoS. (a) Personalized projection learning on [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Manifold projection as a prior for 3D microscopy reconstruction, composed of two stages. [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative inspection of the domain adaptation technique in volumetric reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Intersection–residual learning via coupled cross-projections. Left: The architecture details. [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: From residual learning to transformers under the PoS framework. [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Dual-branch attention (DBA) for subspace selection. Left: One layer of hierarchical [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
read the original abstract

While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior via a set of geometric postulates. It claims that these axioms and their derived consequences furnish a unified perspective on representation, computation, and generalization for both shallow and deep architectures, while supplying geometric explanations for core questions including representation structure, architectural mechanisms, and generalization behavior.

Significance. If the postulates can be rigorously linked to SGD dynamics and shown to generate non-trivial, falsifiable consequences that match observed network behavior, the framework could supply a coherent theoretical foundation that moves beyond black-box descriptions. The explicit attempt at an axiomatic treatment is a constructive direction for the field.

major comments (2)
  1. [Abstract] Abstract: the assertion that the PoS axioms 'yield geometric explanations' for representation structure, architectural mechanisms, and generalization behavior is unsupported by any derivation steps, formal proofs, or empirical checks within the provided text; the central claim therefore rests on an unshown transition from postulates to concrete predictions.
  2. [§§2–3] §§2–3: the geometric postulates are introduced as primitive without a derivation from gradient-based optimization (chain rule) or the geometry of the empirical risk surface, so it remains unclear whether they explain hierarchical feature learning and generalization or merely re-describe them.
minor comments (1)
  1. [Introduction] The analogy drawn to the pre-axiomatic stage of classical geometry could be sharpened by identifying which specific geometric results (e.g., parallel postulate consequences) are meant to parallel the intended NN theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments identify important opportunities to strengthen the clarity and rigor of the axiomatic presentation. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the PoS axioms 'yield geometric explanations' for representation structure, architectural mechanisms, and generalization behavior is unsupported by any derivation steps, formal proofs, or empirical checks within the provided text; the central claim therefore rests on an unshown transition from postulates to concrete predictions.

    Authors: We agree that the abstract statement is stated at a high level of generality. Sections 4 and 5 of the manuscript derive several concrete geometric consequences from the PoS axioms, including the emergence of hierarchical subspace pursuit, a geometric account of skip connections, and a subspace-based generalization bound. To make the transition from postulates to predictions explicit already in the abstract, we will revise the abstract to reference these specific derived results and add forward pointers to the relevant theorems and corollaries. revision: yes

  2. Referee: [§§2–3] §§2–3: the geometric postulates are introduced as primitive without a derivation from gradient-based optimization (chain rule) or the geometry of the empirical risk surface, so it remains unclear whether they explain hierarchical feature learning and generalization or merely re-describe them.

    Authors: The PoS hypothesis is formulated as an axiomatic system in which the geometric postulates are taken as primitives, analogous to the role of incidence and congruence axioms in classical geometry. The manuscript motivates these postulates from observed neural-network phenomenology and then derives non-trivial consequences (e.g., progressive subspace alignment and implicit regularization effects) that go beyond re-description. Nevertheless, we acknowledge the value of an explicit link to SGD dynamics. We will add a new subsection in Section 3 that sketches how the postulates can arise as effective descriptions of gradient flow on the empirical risk surface, drawing on existing results on the geometry of overparameterized loss landscapes, while preserving the axiomatic character of the framework. revision: partial

Circularity Check

0 steps flagged

Axiomatic postulates presented as primitives; derivations self-contained

full rationale

The manuscript introduces the PoS hypothesis explicitly as a set of geometric postulates chosen to formulate observed neural network behavior, then derives consequences for representation, computation, and generalization. No quoted equations or sections reduce any claimed prediction or explanation back to the postulates by construction (e.g., no fitted parameters renamed as predictions, no self-citation chain supplying the load-bearing uniqueness, and no ansatz smuggled via prior work). The framework is therefore self-contained as an axiomatic starting point rather than a tautological re-description of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's central claim rests on a new set of geometric postulates introduced without derivation from prior theory or data; these postulates function as the primary axioms. No free parameters or invented entities are explicitly named in the abstract, but the framework itself constitutes an ad-hoc axiomatic layer placed on top of existing network observations.

axioms (1)
  • ad hoc to paper Neural network behavior can be formulated through a set of geometric postulates concerning pursuit of subspaces.
    Stated in abstract as the foundational hypothesis; no prior justification or external derivation is supplied.

pith-pipeline@v0.9.0 · 5683 in / 1319 out tokens · 33539 ms · 2026-05-21T06:48:43.575166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 8 internal anchors

  1. [1]

    Operational sup- port estimator networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8442–8458, 2024

    Mete Ahishali, Mehmet Yamac, Serkan Kiranyaz, and Moncef Gabbouj. Operational sup- port estimator networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8442–8458, 2024

  2. [2]

    A spline theory of deep learning

    Randall Balestriero et al. A spline theory of deep learning. InInternational Conference on Machine Learning, pages 374–383. PMLR, 2018

  3. [3]

    Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

    Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

  4. [4]

    Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

  5. [5]

    Sampling theorems for signals from the union of finite- dimensional linear subspaces.IEEE Transactions on Information Theory, 55(4):1872–1882, 2009

    Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of finite- dimensional linear subspaces.IEEE Transactions on Information Theory, 55(4):1872–1882, 2009

  6. [6]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veli ˇckovi´c. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges.arXiv preprint arXiv:2104.13478, 2021

  7. [7]

    The restricted isometry property and its implications for compressed sensing.Comptes rendus mathematique, 346(9-10):589–592, 2008

    Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing.Comptes rendus mathematique, 346(9-10):589–592, 2008

  8. [8]

    Compressive sampling

    Emmanuel J Candès et al. Compressive sampling. InProceedings of the International Congress of Mathematicians, volume 3, pages 1433–1452, 2006

  9. [9]

    Decoding by linear programming.IEEE transactions on information theory, 51(12):4203–4215, 2005

    Emmanuel J Candes and Terence Tao. Decoding by linear programming.IEEE transactions on information theory, 51(12):4203–4215, 2005

  10. [10]

    Ecg monitoring in wearable devices by sparse models

    Diego Carrera, Beatrice Rossi, Daniele Zambon, Pasqualina Fragneto, and Giacomo Boracchi. Ecg monitoring in wearable devices by sparse models. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer, 2016

  11. [11]

    Variational Lossy Autoencoder

    Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder.arXiv preprint arXiv:1611.02731, 2016

  12. [12]

    Neural population geometry: An approach for understand- ing biological and artificial neural networks.Current opinion in neurobiology, 70:137–144, 2021

    SueYeon Chung and Larry F Abbott. Neural population geometry: An approach for understand- ing biological and artificial neural networks.Current opinion in neurobiology, 70:137–144, 2021

  13. [13]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–1320. PMLR, 2019

  14. [14]

    Gauge equivariant convolutional networks and the icosahedral cnn

    Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. InInternational conference on Machine learning, pages 1321–1330. PMLR, 2019

  15. [15]

    Group equivariant convolutional networks

    Taco Cohen and Max Welling. Group equivariant convolutional networks. InInternational conference on machine learning, pages 2990–2999. PMLR, 2016

  16. [16]

    Separability and geometry of object manifolds in deep neural networks.Nature communications, 11(1):746, 2020

    Uri Cohen, SueYeon Chung, Daniel D Lee, and Haim Sompolinsky. Separability and geometry of object manifolds in deep neural networks.Nature communications, 11(1):746, 2020. 34

  17. [17]

    Randaugment: Practical automated data augmentation with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020

  18. [18]

    Compressed sensing.IEEE Transactions on information theory, 52(4):1289–1306, 2006

    David L Donoho et al. Compressed sensing.IEEE Transactions on information theory, 52(4):1289–1306, 2006

  19. [19]

    Nonlinear orthogonal projection

    Ewa Dudek and Konstanty Holly. Nonlinear orthogonal projection. InAnnales Polonici Mathematici, volume 59, pages 1–31. Polska Akademia Nauk. Instytut Matematyczny PAN, 1994

  20. [20]

    Cambridge University Press, 2018

    Bjørn Ian Dundas.A short course in differential topology. Cambridge University Press, 2018

  21. [21]

    Recommended practice for testing and reporting performance results of ventricular arrhythmia detection algorithms.Arlington, VA, 1987

    Association for the Advancement of Medical Instrumentation. Recommended practice for testing and reporting performance results of ventricular arrhythmia detection algorithms.Arlington, VA, 1987

  22. [22]

    A theory of cortical responses.Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005

    Karl Friston. A theory of cortical responses.Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005

  23. [23]

    The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

    Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

  24. [24]

    Gabbouj, S

    M. Gabbouj, S. Kiranyaz, J. Malik, M. U. Zahid, T. Ince, M. E. H. Chowdhury, A. Khandakar, and A. Tahir. Robust Peak Detection for Holter ECGs by Self-Organized Operational Neural Networks.IEEE Trans Neural Netw Learn Syst, PP, Mar 2022

  25. [25]

    Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

    Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

  26. [26]

    Towards trustworthy deep learning for image reconstruction

    Alexis Marie Frederic Goujon. Towards trustworthy deep learning for image reconstruction. Technical report, EPFL, 2024

  27. [27]

    American Mathematical Society, 2025

    Victor Guillemin and Alan Pollack.Differential topology, volume 370. American Mathematical Society, 2025

  28. [28]

    Fourier light-field microscopy.Optics express, 27(18):25573–25594, 2019

    Changliang Guo, Wenhao Liu, Xuanwen Hua, Haoyu Li, and Shu Jia. Fourier light-field microscopy.Optics express, 27(18):25573–25594, 2019

  29. [29]

    Principles of riemannian geometry in neural networks.Advances in neural information processing systems, 30, 2017

    Michael Hauser and Asok Ray. Principles of riemannian geometry in neural networks.Advances in neural information processing systems, 30, 2017

  30. [31]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  31. [32]

    Analysis of a complex of statistical variables into principal components

    Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933

  32. [33]

    Splinecam: Exact visualization and characterization of deep network geometry and decision boundaries

    Ahmed Imtiaz Humayun, Randall Balestriero, Guha Balakrishnan, and Richard G Baraniuk. Splinecam: Exact visualization and characterization of deep network geometry and decision boundaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3789–3798, 2023

  33. [34]

    Geometric manifold learning.IEEE Signal Processing Magazine, 28(2):69–76, 2011

    Arta A Jamshidi, Michael J Kirby, and Dave S Broomhead. Geometric manifold learning.IEEE Signal Processing Magazine, 28(2):69–76, 2011

  34. [35]

    Extensions of Lipschitz Mappings Into a Hilbert Space.Contemporary mathematics, 26(189-206):1, 1984

    William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz Mappings Into a Hilbert Space.Contemporary mathematics, 26(189-206):1, 1984

  35. [36]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  36. [37]

    Real-time patient-specific ecg classifi- cation by 1-d convolutional neural networks.IEEE Transactions on Biomedical Engineering, 63(3):664–675, 2016

    Serkan Kiranyaz, Turker Ince, and Moncef Gabbouj. Real-time patient-specific ecg classifi- cation by 1-d convolutional neural networks.IEEE Transactions on Biomedical Engineering, 63(3):664–675, 2016. 35

  37. [38]

    Personalized monitoring and advance warning system for cardiac arrhythmias.Scientific Reports, 7(1):9270, 2017

    Serkan Kiranyaz, Turker Ince, and Moncef Gabbouj. Personalized monitoring and advance warning system for cardiac arrhythmias.Scientific Reports, 7(1):9270, 2017

  38. [39]

    On the generalization of equivariance and convolution in neural networks to the action of compact groups

    Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. InInternational conference on machine learning, pages 2747–2755. PMLR, 2018

  39. [40]

    Masked autoencoders for microscopy are scalable learners of cellular biology

    Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Leung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, et al. Masked autoencoders for microscopy are scalable learners of cellular biology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11757–11768, 2024

  40. [41]

    Neural tuning and representational geometry.Nature Reviews Neuroscience, 22(11):703–718, 2021

    Nikolaus Kriegeskorte and Xue-Xin Wei. Neural tuning and representational geometry.Nature Reviews Neuroscience, 22(11):703–718, 2021

  41. [42]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  42. [43]

    Springer, 2000

    John M Lee.Introduction to topological manifolds. Springer, 2000

  43. [44]

    Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group

    Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. InInternational Conference on Machine Learning, pages 3794–3803. PMLR, 2019

  44. [45]

    Detection of ecg characteristic points using wavelet transforms.IEEE Transactions on biomedical Engineering, 42(1):21–28, 1995

    Cuiwei Li, Chongxun Zheng, and Changfeng Tai. Detection of ecg characteristic points using wavelet transforms.IEEE Transactions on biomedical Engineering, 42(1):21–28, 1995

  45. [46]

    Orthogonal deep neural networks.IEEE transactions on pattern analysis and machine intelligence, 43(4):1352–1368, 2019

    Shuai Li, Kui Jia, Yuxin Wen, Tongliang Liu, and Dacheng Tao. Orthogonal deep neural networks.IEEE transactions on pattern analysis and machine intelligence, 43(4):1352–1368, 2019

  46. [47]

    Towards robust neural networks via random self-ensemble

    Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. InProceedings of the european conference on computer vision (ECCV), pages 369–385, 2018

  47. [48]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

  48. [49]

    A theory for sampling signals from a union of subspaces.IEEE transactions on signal processing, 56(6):2334–2345, 2008

    Yue M Lu and Minh N Do. A theory for sampling signals from a union of subspaces.IEEE transactions on signal processing, 56(6):2334–2345, 2008

  49. [50]

    Ecg databases for biometric systems: A systematic review.Expert Systems with Applications, 67:189–202, 2017

    Mario Merone, Paolo Soda, Mario Sansone, and Carlo Sansone. Ecg databases for biometric systems: A systematic review.Expert Systems with Applications, 67:189–202, 2017

  50. [51]

    On the number of linear regions of deep neural networks.Advances in neural information processing systems, 27, 2014

    Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks.Advances in neural information processing systems, 27, 2014

  51. [52]

    The impact of the mit-bih arrhythmia database.IEEE Engineering in Medicine and Biology Magazine, 20(3):45–50, 2001

    George B Moody and Roger G Mark. The impact of the mit-bih arrhythmia database.IEEE Engineering in Medicine and Biology Magazine, 20(3):45–50, 2001

  52. [53]

    CRC press, 2018

    Mikio Nakahara.Geometry, topology and physics. CRC press, 2018

  53. [54]

    Sample complexity of testing the manifold hypothesis

    Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. Advances in neural information processing systems, 23, 2010

  54. [55]

    Adding Gradient Noise Improves Learning for Very Deep Networks

    Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks.arXiv preprint arXiv:1511.06807, 2015

  55. [56]

    Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

    Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

  56. [57]

    Mortal computation: A foundation for biomimetic intelligence.arXiv preprint arXiv:2311.09589, 2023

    Alexander Ororbia and Karl Friston. Mortal computation: A foundation for biomimetic intelligence.arXiv preprint arXiv:2311.09589, 2023

  57. [58]

    A real-time qrs detection algorithm.IEEE transactions on biomedical engineering, (3):230–236, 1985

    Jiapu Pan and Willis J Tompkins. A real-time qrs detection algorithm.IEEE transactions on biomedical engineering, (3):230–236, 1985

  58. [59]

    On the number of response regions of deep feed forward networks with piece-wise linear activations

    Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations.arXiv preprint arXiv:1312.6098, 2013. 36

  59. [60]

    A neural manifold view of the brain

    Matthew G Perich, Devika Narain, and Juan A Gallego. A neural manifold view of the brain. Nature Neuroscience, 28(8):1582–1597, 2025

  60. [61]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  61. [62]

    Prince.Understanding Deep Learning

    Simon J.D. Prince.Understanding Deep Learning. The MIT Press, 2023

  62. [63]

    On the expressive power of deep neural networks

    Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. Ininternational conference on machine learning, pages 2847–2854. PMLR, 2017

  63. [64]

    Efficient learning of sparse representations with an energy-based model.Advances in neural information processing systems, 19, 2006

    Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann Cun. Efficient learning of sparse representations with an energy-based model.Advances in neural information processing systems, 19, 2006

  64. [65]

    The manifold tangent classifier.Advances in neural information processing systems, 24, 2011

    Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classifier.Advances in neural information processing systems, 24, 2011

  65. [66]

    The unreasonable effectiveness of deep learning in artificial intelligence

    Terrence J Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48):30033–30038, 2020

  66. [67]

    Bounding and counting linear regions of deep neural networks

    Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting linear regions of deep neural networks. InInternational conference on machine learning, pages 4558–4566. PMLR, 2018

  67. [68]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  68. [69]

    Tu.An Introduction to Manifolds

    L.W. Tu.An Introduction to Manifolds. Universitext. Springer New York, 2010

  69. [70]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

  70. [71]

    Cvt: Introducing convolutions to vision transformers

    Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021

  71. [72]

    Masked frequency modeling for self-supervised visual pre-training

    Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. Masked frequency modeling for self-supervised visual pre-training. InThe Eleventh International Conference on Learning Representations

  72. [73]

    Mehmet Yamaç, Mete Ahishali, Serkan Kiranyaz, and Moncef Gabbouj. Convolutional sparse support estimator network (csen): From energy-efficient support estimation to learning-aided compressive sensing.IEEE Transactions on Neural Networks and Learning Systems, 34(1):290– 304, 2021

  73. [74]

    Mehmet Yamaç, Mert Duman, ˙Ilke Adalıo˘glu, Serkan Kiranyaz, and Moncef Gabbouj. A personalized zero-shot ecg arrhythmia monitoring system: From sparse representation based domain adaption to energy efficient abnormal beat detection for practical ecg surveillance.arXiv preprint arXiv:2207.07089, 2022

  74. [75]

    Video-rate 3d imaging of living cells using fourier view-channel-depth light field microscopy.Communications biology, 6(1):1259, 2023

    Chengqiang Yi, Lanxin Zhu, Jiahao Sun, Zhaofei Wang, Meng Zhang, Fenghe Zhong, Luxin Yan, Jiang Tang, Liang Huang, Yu-Hui Zhang, et al. Video-rate 3d imaging of living cells using fourier view-channel-depth light field microscopy.Communications biology, 6(1):1259, 2023

  75. [76]

    Cutmix: Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019

  76. [77]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

  77. [78]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017. 37 A Notation In this work, we consider the ℓp–norm of a vector x∈R n, defined by ∥x∥p = (Pn i=1 |xi|p)1/p with p≥1 . The ℓ0 “norm” is given by ∥x∥0 = limp→0 Pn i=1 |xi|p, which counts the number of nonzero e...