arxiv: 2605.04325 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.DM· math.CO

Recognition: unknown

On the Architectural Complexity of Neural Networks

Nicholas J. Cooper , Fran\c{c}ois G. Meyer , Michael L. Roberts , Carlos Zapata-Carratal\'a , Lijun Chen , Danna Gurari

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.DMmath.CO

keywords deep neural networksarchitectural complexitytensor operationsDNN evolutionneural architecture constructioncomplexity analysisautomatic architecture generation

0 comments

The pith

A framework that models tensor operations in neural networks shows groundbreaking architectures increase specific types of complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified theoretical framework for deep neural networks that explicitly tracks the structure of tensor operations rather than abstracting them away. This model makes it possible to measure different kinds of architectural complexity and to trace how those measures have changed across DNNs proposed over the last forty years. The analysis finds a consistent pattern: each major advance aligns with an increase in one or more of these complexity types. The same framework also flags large families of still-unexplored higher-complexity designs and supplies a public dataset of more than three thousand concrete examples.

Core claim

Our study of DNNs introduced over the past 40 years reveals a connection between groundbreaking architectures and increases in different types of architectural complexity. The framework enables two novel objectives: analysis of the evolution of architectural complexity over deep learning history, and automatic construction of novel architectures based on new types of tensor operations. Several large classes of higher complexity architectures have not yet been explored.

What carries the argument

The unified theoretical framework that explicitly models the structure of tensor operations to quantify and generate increases in architectural complexity.

If this is right

Groundbreaking architectures increase distinct complexity types rather than overall complexity.
Automatic construction of new architectures becomes feasible by introducing previously unused tensor operations.
Large families of higher-complexity architectures remain unexplored and can now be enumerated systematically.
A dataset of over 3,000 higher-complexity architectures is now available for further research and construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeting particular complexity increases could become a design heuristic for creating more capable networks.
The same tensor-operation lens might be applied to measure complexity growth in non-neural learning systems.
Empirical tests could check whether the identified complexity jumps reliably predict gains in accuracy or efficiency.

Load-bearing premise

The chosen modeling of tensor operations captures the structural features that actually drive architectural breakthroughs rather than merely correlating with them.

What would settle it

A new neural-network architecture that delivers a clear breakthrough in performance or capability while showing no increase in any of the framework's defined complexity types would falsify the claimed connection.

Figures

Figures reproduced from arXiv: 2605.04325 by Carlos Zapata-Carratal\'a, Danna Gurari, Fran\c{c}ois G. Meyer, Lijun Chen, Michael L. Roberts, Nicholas J. Cooper.

**Figure 1.** Figure 1: Proposed hierarchical framework. Arrows de view at source ↗

**Figure 2.** Figure 2: HCCs vs. CCs. Right complex is not an HCC because the 2-cell is composed of 1-cells and 0-cells. Shading of 0-cells indicates X 0 |s 1 1 and X 0 |s 1 2 . Hierarchical Combinatorial Complexes. Our framework is built on hierarchical combinatorial complexes (HCCs) — a type of set hierarchy. Definition 3.1. A rank-r hierarchical combinatorial complex X is a set S along with a ranked family of collections of… view at source ↗

**Figure 3.** Figure 3: A 1-slice space of a matrix. Various partitions of tensors such as the pixel space of an image are slice spaces — higher order functions4 obtained from binary partitions of a tensor’s modes. Definition 3.3. A k-slice space of a tensor T is a tensor-valued tensor: Tλ : [[M1]] × ... × [[MO−k]] → Tσ : [[MO−k+1]] × ... × [[MO]] → V . Slice spaces are function-valued functions which accept (O − k) input arg… view at source ↗

**Figure 4.** Figure 4: Visualization of the generalized tensor associated to a 2×2×2 multidimensional array. Colors indicate the partition of the 1-cells. Generalized tensors are effectively partitions of collections of ordered sets. Their 2-cells correspond to the modes of a multidimensional array and their 1-cells to 1-slices view at source ↗

**Figure 5.** Figure 5: Example of a mode map. Left side: a jagged tensor of order 4 (top) maps into a matrix (bottom). Right side: the mode map components. Shaded arrows indicate the modes. Mode maps are functions between tensors that preserve their slice space structures. As tensors can have many slice spaces, there is not a unique choice of structure to preserve, so mode maps must be defined with respect to collections of mod… view at source ↗

**Figure 6.** Figure 6: TOM representation of the cone product. Rows correspond to input tensors; columns correspond to modes of the output tensor. (•/◦) indicates which modes are/are not part of which inputs. (↓) denotes a contracted mode. Condition 2.i of Definition 4.5 ensures that tensor operations admit useful binary-valued matrix representations that we call tensor operation matrices (TOMs). These representations are effec… view at source ↗

**Figure 7.** Figure 7: Illustration of Theorem 4.6. Top: input tensors to the TOM for matrix multiplication; modes are color-coded. Bottom: operands are broadcasted (dotted arrows) according to the coupling structure and then merged to create a 2-regular hyper-tensor (bottom right). The result is given by applying one base operation along the tuples of the hyper-tensor and then applying the second base operation along the 1-slic… view at source ↗

**Figure 8.** Figure 8: TEM representation of the PolyNets core block. Rows correspond to operations; columns correspond to tensors. (•/◦) indicates which tensors are/are not part of which operations. Analogously to how tensor operations can be represented with TOMs, the top level relational structure in a neural network (orange triangle of view at source ↗

**Figure 9.** Figure 9: Architecture search space. Overview. We collect a dataset of 3, 028 architectures sampled from spaces of the type illustrated in view at source ↗

**Figure 10.** Figure 10: (•,⋆) sampled models. (−−) stage 1 baselines. (■) full ResNet/Swin models at different width-scales. After each tensor operation, some randomly selected non-linear activations (from a pool of: Leaky ReLU, ReLU6, Layer Norm, and Softmax) are added with 50% probability. Shapes for all intermediate tensors are randomly sampled. Benchmarks. Following standard NAS practice [46, 11], we sample architectures for… view at source ↗

**Figure 11.** Figure 11: A new tensor operation from the (⋆) architecture. X1, X2 are input tensors; W1, W2 are learned weight tensors. Cα = 4, CO = 7, CA = 3. Specifically, the red star sample (⋆) achieves 65.52% ± 0.22 with less than 200,000 parameters (152,000 are from the ResNet stage 1), an order of magnitude fewer than MobileNetV2’s 2.5 million. This is notable because MobileNetV2 is still widely used today as a lightweig… view at source ↗

**Figure 12.** Figure 12: Structure diagrams for the default formulation of PWO-HGs from Definition A.4 (left) view at source ↗

**Figure 14.** Figure 14: Structure diagrams for the functional definition of multidimensional arrays (left) and the view at source ↗

**Figure 13.** Figure 13: Copy of Figure 4 view at source ↗

**Figure 15.** Figure 15: Visualization of an example “exterior” corner. (→) edges of T added by E1. (→) edges of T added by E2 Next, we show that WLOG, a ′ is an endpoint of ej and b is not a e ′ j endpoint of the same type, that is, a ′ and b cannot be simultaneously both maximal or both minimal with respect to <ej and <e ′ j . To see this, observe that the equal length property and the maximality of H show that if a ′ and b we… view at source ↗

**Figure 16.** Figure 16: Index map. Dotted boxes are contracted modes. view at source ↗

**Figure 17.** Figure 17: Visualization of the matrix multiplication of two jagged matrices. Modes are color-coded. view at source ↗

**Figure 18.** Figure 18: Tensor operation matrix (TOM) representation of the fish product. Rows correspond to the input tensors; columns to the modes of the output. Filled/open circles indicate which modes are/are not part of which inputs. Red arrows denote contractions. Observe that despite being arity 3, this operation has coupling arity 2. The Fish Product. The fish product is an example of ternary operation which has only b… view at source ↗

**Figure 11.** Figure 11: For convenience, the TOM for this operation is copied in view at source ↗

**Figure 20.** Figure 20: TOM with CO = 6, Cα = 4, CA = 4. Another high-complexity tensor operation from a top-performing sampled architecture is shown in view at source ↗

**Figure 21.** Figure 21: TOM with CO = 8, Cα = 4, CA = 3. A third high-complexity tensor operation from a top-performing sampled architecture is shown in view at source ↗

**Figure 22.** Figure 22: TEM for Copresheaf message aggregation on a vertex neighborhood of cardinality 3. Naturally, any of these copresheaf-based networks can be expressed in our framework, as they are still combinations of tensor operations and mode maps (most mode maps being trivial). However, there are deeper connections between CNMs and TEMs. For the sake of discussion, let us focus our attention to the message aggregati… view at source ↗

**Figure 23.** Figure 23: TOM for 2-dimensional convolution. Convolutional Networks. We now consider purely convolutional networks such as VGG [40]. We focus our attention to a single convolutional “layer", which consists of a binary operation and an unfolding mode map. The mode map is responsible for extracting patches from the input image of shape (c, h, w), a process which produces an order 5 tensor of shape (c, h′ , w′ , ph,… view at source ↗

**Figure 24.** Figure 24: TEM for a residual block. As demonstrated in view at source ↗

**Figure 25.** Figure 25: TEM for self attention. Self-Attention. Next, we consider transformer models such as [44, 12]. We focus our attention to a single self-attention “operation". We start with the system of operations: ZQ = X × WQ ZK = X × WK ZV = X × WV ZA = ZQ × Z T K Y = ZA × ZV view at source ↗

**Figure 26.** Figure 26: TOMs for operations 2 and 3 of self attention. The above would suggest that Cop = 5. However, as there are no activations or non-trivial mode maps between the computation of Z3 and Y , we can combine those operations into a single operation of arity 3 and coupling-arity 2. Similarly, we can simplify the computation of the attention map A. This updated system of equations is below: ZQ = X × WQ A = ZQ × (X… view at source ↗

**Figure 27.** Figure 27: Copy of Figure 1 (top) from [ view at source ↗

**Figure 28.** Figure 28: TEM for Q -net V1. Z1 = X ⊛ W1 Z2 = X ⊛ W2 Z3 = X ⊛ W3 Z4 = Z1 ⊙ Z3 Z5 = Z4 ⊛ W4 Y = Z2 ⊙ Z5 view at source ↗

**Figure 29.** Figure 29: TOM for operations 3 and 4 of Q -net V1. Similarly to self-attention, this system of operations can be simplified by combining operations. Specifically, we can combine the computation of Z3 and Z4 as no non-trivial mode maps nor non-linear activations occur in between. Similarly, we can combine the computation of Z5 and Y . This produces the following system of operations: Z1 = X ⊛ W1 Z2 = X ⊛ W2 Z3 = Z1 … view at source ↗

**Figure 30.** Figure 30: TEM for MO-Nets. MO-Nets. Next, we consider the multilinear operator networks (MO-Nets) of [6]. We focus our attention to the core building block defined by equation (1) of [6]. We use the tensor shapes provided in the original paper. This system of operations is given below: Z1 = X × WA Z2 = X × WB × WD Z3 = Z1 ⊙ Z2 Z4 = Z3 ⊕ Z1 Y = Z4 × WC 47 view at source ↗

**Figure 31.** Figure 31: TOM for operation 1 of MONets. Where X has shape (N, D), WA has shape (D, M), WB has shape (D, L), WD has shape (L, M), and C has shape (M, O). Similarly to the analysis of Poly-Nets, we can combine the computation of Z2 and Z3. Nothing else can be simplified because the TOM for Z4 uses different base operations. The simplified system of operations is given below: Z1 = X × WA Z2 = Z1 ⊙ (X × WB × WD) Z3 =… view at source ↗

**Figure 32.** Figure 32: Selected TOM from ViM. First, the computation on line 8 (Algorithm 1, [48]) can be combined with that of the second term on line 14 to produce a ternary operation. The TOM for this is shown in view at source ↗

**Figure 33.** Figure 33: Copy of Figure 2(d) from [34] (reproduced with permission). Deep Tensor Tree Networks. Next, we consider the tensor tree networks (DTTNs) from [34]. We focus our attention to the core building block of the model, the so-called asymmetric interaction module (AIM) described in view at source ↗

**Figure 34.** Figure 34: TOM for the (2-d) depthwise convolution operation. We integrate the Hadamard product into the right branch as this produces the lowest complexity encoding. The resulting simplest system of operations is given below: Z1 = (X ⊛D WC1) ×Ch WL1 Z2 = X ×Ch WL2 Z3 = (Z2 ⊛D WC2) ⊙ Z1 Z4 = Z3 ×Ch WL3 Y = X ⊕ Z4 view at source ↗

**Figure 35.** Figure 35: TEM for the AIM of DTTNs. Where ⊛D is depthwise convolution and ×Ch is matrix multiplication along the channel mode. The TEM is shown in view at source ↗

**Figure 36.** Figure 36: TOMs 1 (left) and 3 (right) for the AIM of DTTNs. 49 view at source ↗

**Figure 37.** Figure 37: Diagnostic data for ResNet34 trained for view at source ↗

**Figure 38.** Figure 38: Diagnostic data for ResNet34 trained for view at source ↗

**Figure 39.** Figure 39: Diagnostic data for ResNet34 trained for view at source ↗

**Figure 40.** Figure 40: Diagnostic data for Swin_T trained for 50 epochs on CIFAR-10 view at source ↗

**Figure 41.** Figure 41: Diagnostic data for Swin_T trained for 50 epochs on CIFAR-100 view at source ↗

**Figure 42.** Figure 42: Diagnostic data for Swin_T trained for 50 epochs on Tiny Imagenet. 55 view at source ↗

**Figure 43.** Figure 43: Diagnostic data for MobileNetV2 trained for view at source ↗

**Figure 44.** Figure 44: Distribution of sampled architectures (•) vs. number of parameters of the sampled block. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total 526 507 502 513 486 494 view at source ↗

**Figure 45.** Figure 45: Distribution of sampled architectures (•) vs. operation complexity Cop of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total Cop = 2 162 153 151 145 129 141 881 Cop = 3 131 122 127 124 127 128 759 Cop = 4 132 125 124 125 129 123 758 Cop = 5 101 107 1… view at source ↗

**Figure 46.** Figure 46: Distribution of sampled architectures (•) vs. tensor complexity CT of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total CT = 4 1 ∅ 3 ∅ 6 2 12 CT = 5 38 45 35 59 21 40 238 CT = 6 62 63 60 45 42 53 325 CT = 7 77 74 75 70 82 62 440 CT = 8 62 54 64 57 6… view at source ↗

**Figure 47.** Figure 47: Distribution of sampled architectures (•) vs. arity complexity Cα of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total Cα = 2 6 7 7 10 5 6 41 Cα = 3 116 127 105 111 90 108 657 Cα = 4 404 373 390 392 391 380 2330 view at source ↗

**Figure 48.** Figure 48: Distribution of sampled architectures (•) vs. order complexity CO of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total CO = 3 ∅ 7 ∅ 3 ∅ 7 17 CO = 4 ∅ 58 ∅ 80 ∅ 213 351 CO = 5 156 193 140 183 131 163 966 CO = 6 144 159 160 162 117 89 831 CO = 7 151 7… view at source ↗

**Figure 49.** Figure 49: Distribution of sampled architectures (•) vs. coupling arity complexity CA of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total CA = 2 79 81 74 84 71 65 454 CA = 3 357 316 303 319 300 293 1888 CA = 4 90 110 125 110 115 136 686 view at source ↗

**Figure 50.** Figure 50: Distribution of sampled architectures (•) vs. number of non-linear activations CNL of the sampled block. (−−) Mean sample accuracy. (−−) Median sample accuracy. (—) Stage 1 baseline accuracy. Dataset CIFAR 10 CIFAR 100 Tiny ImageNet Stage 1 ResNet34 SWIN_T ResNet34 SWIN_T ResNet34 SWIN_T Total CNL = 0 53 70 51 81 39 95 389 CNL = 1 73 81 87 98 85 79 503 CNL = 2 135 146 121 120 123 122 767 CNL = 3 89 92 94 … view at source ↗

**Figure 51.** Figure 51: Tensor equation matrix for the sampled block of the view at source ↗

**Figure 52.** Figure 52: Tensor operation matrices for operations view at source ↗

**Figure 53.** Figure 53: Tensor operation matrix for operation 3 used in the ⋆ architecture. CO = 9, Cα = 4, CA = 2. 66 view at source ↗

**Figure 54.** Figure 54: Tensor operation matrix for operation 4 used in the ⋆ architecture. CO = 7, Cα = 4, CA = 3. Same TOM as in view at source ↗

read the original abstract

We introduce a unified theoretical framework for the rigorous analysis and systematic construction of deep neural networks (DNNs). This framework addresses a gap in existing theory by explicitly modeling the structure of tensor operations -- lower level information that is often abstracted. Our framework enables two novel objectives: (1) analysis of the evolution of architectural complexity over deep learning history, and (2) automatic construction of novel architectures based on new types of tensor operations. Our study of DNNs introduced over the past 40 years reveals a connection between groundbreaking architectures and increases in different types of architectural complexity. Moreover, we identify several large classes of higher complexity architectures that have not yet been explored. We then collect a dataset of 3,000+ higher complexity architectures, which we publicly release at: https://github.com/combinatoriallabs/ArchitecturalComplexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The tensor-operation framework and released dataset are concrete additions worth checking, but the historical link between complexity jumps and breakthroughs looks vulnerable to circularity.

read the letter

The paper defines architectural complexity through explicit tensor operations instead of the usual high-level blocks. This gives them a way to both dissect how networks have changed over decades and to generate new ones by combining operations that haven't been tried together. They back the generation side with a public release of more than 3000 higher-complexity architectures that sit outside the ones already explored in the literature.

Referee Report

1 major / 1 minor

Summary. The paper introduces a unified theoretical framework for deep neural networks that explicitly models the structure of tensor operations. This framework is used to analyze the evolution of architectural complexity across DNNs introduced over the past 40 years, revealing a claimed connection between groundbreaking architectures and increases in various complexity types. It further identifies unexplored classes of higher-complexity architectures and releases a public dataset of over 3,000 such architectures.

Significance. If the tensor-operation modeling supplies an independent, a priori complexity metric that can be applied uniformly without hindsight bias, the work could provide a systematic lens on why certain architectures succeeded historically and a basis for generating new ones. The public dataset release is a concrete positive contribution that enables community follow-up.

major comments (1)

The central historical analysis risks circularity: if complexity types are identified by inspecting the tensor operations that distinguish the 'groundbreaking' architectures (e.g., skip connections or attention), then the reported correlation is at least partly definitional rather than predictive. The manuscript must demonstrate that the complexity measures are pre-specified and applied blindly to a fixed corpus, independent of the historical narrative used to select examples.

minor comments (1)

The abstract and early sections provide only high-level descriptions of the framework and findings; concrete definitions of the complexity types, the precise tensor-operation primitives, and an example calculation on a known architecture (e.g., ResNet or Transformer) would improve accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address the single major comment below, and we will revise the paper to improve the clarity of our methodological description.

read point-by-point responses

Referee: The central historical analysis risks circularity: if complexity types are identified by inspecting the tensor operations that distinguish the 'groundbreaking' architectures (e.g., skip connections or attention), then the reported correlation is at least partly definitional rather than predictive. The manuscript must demonstrate that the complexity measures are pre-specified and applied blindly to a fixed corpus, independent of the historical narrative used to select examples.

Authors: We appreciate the referee highlighting this methodological concern. The complexity types in our framework (arity, transformation, and composition complexity) are formally defined in Sections 2 and 3 as general mathematical properties of arbitrary tensor operation sequences; these definitions are stated prior to any reference to specific architectures or historical events and make no use of examples such as skip connections or attention. The historical corpus consists of a fixed list of 42 landmark DNNs spanning 1980–2023, compiled from established surveys of deep-learning history and selected solely on the basis of their documented impact in the literature, without reference to our metrics. The pre-defined measures were then applied uniformly to this corpus. We acknowledge that the current manuscript does not contain an explicit statement of this ordering and pre-specification. In the revised version we will therefore insert a new subsection (tentatively 4.1) that (i) restates the a-priori definitions, (ii) reproduces the exact corpus list together with its selection criteria, and (iii) describes the blind computation procedure. This addition will make the independence of the analysis fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and historical analysis presented as independent

full rationale

The provided abstract and excerpts introduce a tensor-operation modeling framework as a novel, independent tool for both analyzing 40 years of DNN history and constructing new architectures. The claimed connection between groundbreaking architectures and complexity increases is presented as an empirical revelation from applying the framework, not as a definitional or fitted outcome. No equations, self-citations, or derivations are quoted that reduce the complexity measure to the same operations used to label architectures as groundbreaking. The release of a separate 3000+ architecture dataset further indicates an external corpus. Per hard rules, absent specific quotable reductions (e.g., Eq. X defined in terms of Y or fitted parameter renamed as prediction), the derivation chain is self-contained with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is described at a high level without detailing underlying mathematical assumptions or new postulated objects.

pith-pipeline@v0.9.0 · 5461 in / 1143 out tokens · 28415 ms · 2026-05-08T17:26:10.459376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

2015
[2]

Eskue, Roger M

Sasan Salmani Pour Avval, Nathan D. Eskue, Roger M. Groves, and Vahid Yaghoubi. Systematic review on neural architecture search.Artificial Intelligence Review, 2025

2025
[3]

Pld: A choice-theoretic list-wise knowledge distillation

Ejafa Bassam, Dawei Zhu, and Kaigui Bian. Pld: A choice-theoretic list-wise knowledge distillation. InNeurIPS, 2025

2025
[4]

Alain Bretto.Hypergraph Theory, an Introduction.Springer, 2013. 10

2013
[5]

Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. 2021

2021
[6]

Chrysos, Markos Georgopoulos, and V olkan Cevher

Yixin Cheng, Grigorios G. Chrysos, Markos Georgopoulos, and V olkan Cevher. Multilinear operator networks. InICLR, 2024

2024
[7]

Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, and Stefanos Zafeiriou

Grigorios G. Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, and Stefanos Zafeiriou. P-nets: Deep polynomial neural networks. InCVPR, 2020

2020
[8]

Chrysos, Markos Georgopoulos, Jiankang Deng, Jean Kossaifi, Yannis Panagakis, and Anima Anandkumar

Grigorios G. Chrysos, Markos Georgopoulos, Jiankang Deng, Jean Kossaifi, Yannis Panagakis, and Anima Anandkumar. Augmenting deep classifiers with polynomial neural networks. In ECCV, 2022

2022
[9]

Chrysos, Yongtao Wu, Razvan Pascanu, Philip H.S

Grigorios G. Chrysos, Yongtao Wu, Razvan Pascanu, Philip H.S. Torr, and V olkan Cevher. Hadamard product in deep learning: Introduction, advances and challenges. InIEEE PAMI, 2025

2025
[10]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. InICCV, 2017

2017
[11]

Nas-bench-201: Extending the scope of reproducible neural architecture search

Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. InICLR, 2020

2020
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, , and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[13]

On expressivity and trainability of quadratic networks

Feng-Lei Fan, Mengzhou Li, Fei Wang, Rongjie Lai, and Ge Wang. On expressivity and trainability of quadratic networks. InIEEE NNLS, 2023

2023
[14]

Araujo, and Petar Velickovi´c

Bruno Gavranovi´c, Paul Lessard, Andrew Dudzik, Tamara von Glehn, Joao G.M. Araujo, and Petar Velickovi´c. Position: Categorical deep learning is an algebraic theory of all architectures. InICML, 2024

2024
[15]

Mamba: Linear-time sequence modeling with selective state spaces,

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,
[16]

URLhttps://arxiv.org/abs/2312.00752

work page internal anchor Pith review arXiv
[17]

Dey, Soham Mukherjee, Shreyas N

Mustafa Hajij, Ghada Zamzmi, Theodore Papamarkou, Nina Miolane, Aldo Guzmán-Sáenz, Karthikeyan Natesan Ramamurthy, Tolga Birdal, Tamal K. Dey, Soham Mukherjee, Shreyas N. Samaga, Neal Livesay, Robin Walters, Paul Rosen, and Michael T. Schaub. Topological deep learning: Going beyond graph data. 2022

2022
[18]

Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G

Mustafa Hajij, Lennart Bastian, Sarah Osentoski, Hardik Kabaria, John L. Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G. Kocheemoolayil, Nastaran Shahmansouri, Adrian Lew, Theodore Papamarkou, and Tolga Birdal. Copresheaf topological neural networks: A general- ized deep learning framework. InNeurIPS, 2025

2025
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, , and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016

2016
[20]

Category-theoretical and topos- theoretical frameworks in machine learning: A survey.Axioms, 2025

Yiyang Jia, Guohong Peng, Zheng Yang, and Tianhao Chen. Category-theoretical and topos- theoretical frameworks in machine learning: A survey.Axioms, 2025

2025
[21]

On the expressive power of deep polynomial neural networks

Joe Kileel, Matthew Trager, and Joan Bruna. On the expressive power of deep polynomial neural networks. InNeurIPS, 2019

2019
[22]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR, 2015

2015
[23]

Krizhevsky, V

A. Krizhevsky, V . Nair, and G. Hinton. Cifar-10 and cifar100 datasets. 2009. URL https: //www.cs.toronto.edu/kriz/cifar.html

2009
[24]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InNeurIPS, 2012. 11

2012
[25]

Tiny imagenet visual recognition challenge

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. 2015. URL https: //api.semanticscholar.org/CorpusID:16664790

2015
[26]

& Haffner, P

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[27]

Diversity-enhanced distribution alignment for dataset distillation

Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, and Weiping Wang. Diversity-enhanced distribution alignment for dataset distillation. InICCV, 2025

2025
[28]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2019

2019
[29]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021

2021
[30]

Algebra unveils deep learning an invitation to neuroalgebraic geometry

Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, and Kathlen Kohn. Algebra unveils deep learning an invitation to neuroalgebraic geometry. InICML, 2025

2025
[31]

Mesner and Prabir Bhattacharya

Dale M. Mesner and Prabir Bhattacharya. Association schemes on triples and a ternary algebra. JOURNAL OF COMBINATORIAL THEORY, 1988

1988
[32]

Ibomoiye Domor Mienye and Theo G. Swart. A comprehensive review of deep learning: Architectures, recent advances, and applications. InInformation, 2024

2024
[33]

Spiralmlp: A lightweight vision mlp architecture

Haojie Mu, Burhan Ul Tayyab, and Nicholas Chua. Spiralmlp: A lightweight vision mlp architecture. InWACV, 2024

2024
[34]

Folding over neural networks, 2022

Minh Nguyen and Nicholas Wu. Folding over neural networks, 2022

2022
[35]

Deep tree tensor networks for image recognition

Chang Nie, Junfang Chen, and Yajie Chen. Deep tree tensor networks for image recognition. In NeurIPS, 2025

2025
[36]

A survey on state-of-the-art deep learning applications and challenges

Mohd Halim Mohd Noor and Ayokunle Olalekan Ige. A survey on state-of-the-art deep learning applications and challenges. InEngineering Applications of Artificial Intelligence, 2025

2025
[37]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNIPS-W, 2017

2017
[38]

Learning representations by back- propagating errors., 1986

David Rumelhart, Geoffrey Hinton, and Ronald Williams. Learning representations by back- propagating errors., 1986

1986
[39]

Sandler, A

M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InCVPR, 2018

2018
[40]

Learning on a razor’s edge: the singularity bias of polynomial neural networks

Vahid Shahverdi, Giovanni Luca Marchetti, and Kathlen Kohn. Learning on a razor’s edge: the singularity bias of polynomial neural networks. InICLR, 2026

2026
[41]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015

2015
[42]

Smith and Nicholay Topin

Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2017

2017
[43]

Logit standardization in knowledge distillation

Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xiaochun Cao. Logit standardization in knowledge distillation. InCVPR, 2024

2024
[44]

Pure and spurious critical points: A geometric study of linear networks

Matthew Trager, Kathlen Kohn, and Joan Bruna. Pure and spurious critical points: A geometric study of linear networks. InICLR, 2020

2020
[45]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

2017
[46]

Equivariant and coordinate independent convolutional networks, 2023

Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling. Equivariant and coordinate independent convolutional networks, 2023. 12

2023
[47]

Nas-bench-101: Towards reproducible neural architecture search

Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards reproducible neural architecture search. InICML, 2019

2019
[48]

Arsiwalla, and Taliesin Beynon

Carlos Zapata-Carratalá, Xerxes D. Arsiwalla, and Taliesin Beynon. Diagrammatic calculus and generalized associativity for higher-arity tensor operations.Theoretical Computer Science, 2024

2024
[49]

list distances

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. In ICML, 2024. 13 Table of Contents ATheoretical Appendix. A.1Generalized Tensors. A.1.1 Technical Background. A.1.2 Complete definition of the slice ordering compatibility condit...

2024
[50]

A partition ofE,P∈ P(P(E)). 17
[51]

one index away

An edge-indexed family of strict weak orders, <e ⊊e×e e∈E. A PWO-HG without the partition is called aweakly ordered hypergraph. Intuitively, weakly ordered hypergraphs are hypergraphs which have lists for edges. As such, they are a strict generalization of the well-studied interval hypergraphs (see [4], section 4.1). We recall that an interval hypergraph ...
[52]

For any pathγ, X p∈P dp(γ) =d(γ)
[53]

For any pathγ, d p(γ) =−d p(¯γ)
[54]

elements

For any pair of composable pathsγ a,b, γb,c, d p(γb,c ◦γ a,b) =d p(γb,c) +d p(γa,b) Proof. These identities follow immediately from the definition of p-distance and the fact that P is a partition ofE. As PWO-HGs will be used to study hyper-tensors, we now introduce the notion of a path between tuples. Definition A.9.Atupleof a PWO-HG H=⟨V, E,{< e}e∈E, P⟩ ...
[55]

The2-cells ofs 3 form a partition of the⊊-maximal elements ofX 1|s3
[56]

∀x∈ X 0|s3 ,∃{s 1 i }i∈I ⊏s 3 withx∈ |s3|\ i=1 s1 i

Eachx∈ X 0|s3 is contained in the intersection of some transversal of this partition. ∀x∈ X 0|s3 ,∃{s 1 i }i∈I ⊏s 3 withx∈ |s3|\ i=1 s1 i
[57]

essentially

The structure H= X 0|s3 ,X 1|s3 ,X 2|s3 forms a partitioned and weakly ordered hyper- graph which satisfies theslice ordering compatibility conditions: (a)His connected. (b) All tuple cycles onHhave zerop-distance, for eachp∈s 3. 19 The only difference from Definition 4.1 is the specification of condition 1 to the subset-maximal elements of s3, which is n...
[58]

Whenevera, b∈ X 0|s3 1 satisfy:d(γ a,b) = 0, add[{a, b}]to eachs 2 ∈s 3
[59]

degenerate

Whenevera, b∈ X 0|s3 1 satisfy:|d(γ a,b)|= 1, add[a, b]tos 2 ifd s2(γa,b) = 1, otherwise, add[b, a]tos 2, wheres 2 is the2-cell with|d s2(γa,b)|= 1. The generalized tensor s3 2 produced by this construction is maximal because the added 1-cells ensure the required implications. Furthermore, as none of the added 1-cells introduce cycles of non-zero distance...
[60]

We can continue this process until we reach some i such that ds1(γxi,xi+1) = 1 and xi ∈s 1 1 and xi+1 ∈s 1

Analogously, x2 ∈s 1 1 and xn−1 ∈s 1 2, and so on. We can continue this process until we reach some i such that ds1(γxi,xi+1) = 1 and xi ∈s 1 1 and xi+1 ∈s 1
[61]

efficient

By the previous argument, this is impossible. We conclude thats 3 c is unique up to mode permutations. Canonical representatives are useful for reducing the rank complexity of the HCC encodings of generalized tensors. 23 Vertices Base setX 0 Edges X 1 ⊂ P(X 0) Ordered Pairs X 2 ⊂ P(X 1) Ordered Edges X 3 ⊂ P(X 2) Parts X 4 ⊂ P(X 3) PWO-HGs X 5 ⊂ P(X 4) El...
[62]

minor technical update

For each transversal{s 1 i }i∈I ⊏s 3,we have that |s3|\ i=1 s1 i ≤1 thens 3 can be represented as an injective multidimensional array. The “minor technical update” as compared to Theorem 4.2 is the restriction of hypotheses 1 to the canonical representative for s3. This is necessary because hypothesis 1 (theequal lengthcondition) is meaningless in the pre...
[63]

exterior corner

Each{v} ∈ X 0 is contained in a transversal of this supposed partition. 3.X 1 satisfies the slice ordering compatibility conditions. To show 1, let s1 ∈ X 1 be a ⊊-maximal 1-cell. By construction, s1 is a set of v∈V which corresponds under T to a set of multi-indices with O−1 of the entries fixed. Let i denote the free index. As X 2 was constructed such t...
[64]

That is, pick zi = argmaxz minx∈X 0|Ti d(z, x)

Let Hi be the PWO-HG (Definition A.4) associated to Ti, and pick an origin zi ∈ X 0|Ti to maximize the minimum distance path from zi. That is, pick zi = argmaxz minx∈X 0|Ti d(z, x)
[65]

Construct multi-indices for each tuple ofH i with Lemma A.13, setting the origin toz i
[66]

Call the resulting set of multi-indices Ii ⊆[[M i,1]]×...×[[M i,Oi]], whereO i =|X 2|Ti |andM i,j is the tensor length of thej th mode ofT i

If necessary, introduce offsets to these indices to ensure that no multi-index contains any non-positive entries. Call the resulting set of multi-indices Ii ⊆[[M i,1]]×...×[[M i,Oi]], whereO i =|X 2|Ti |andM i,j is the tensor length of thej th mode ofT i. 35
[67]

Otherwise, assign ∅

Define a Vi-valued multidimensional array Ti by assigning to each multi-index of Ii the corresponding tuple if one exists. Otherwise, assign ∅. By the SOCCs, this is a well-defined function
[68]

tensor operation

Collapse the tuples of Ti to their products with the rule: t∈image(T i)7→ Q x∈t x, whereQ is shorthand for iterated application of⋆. This process produces a multidimensional array Ti : [[Mi,1]]×...×[[M i,Oi]]→V i for each tensor Ti. While these multidimensional arrays may be jagged, they are by construction not hyper. Step 2.Denote by M the set of all mod...
[69]

Before we discuss the somewhat technical proof, we provide a fundamental corollary as further motivation

Each operand ofs 4 can be represented as an injective multidimensional array 2.⋆distributes over⋄ 3.⋄is associative then, after evaluation using ⋆ and ⋄, s4 is equal to the composition of an (α−1) -arity tensor operation and a binary tensor operation. Before we discuss the somewhat technical proof, we provide a fundamental corollary as further motivation....
[70]

index" and “mode

To facilitate this process, we now perform some index management: (i1, ..., iσ)— non-contracted indices ofA (iσ+1, ..., iCO)— contracted indices ofA (j1, ..., jρ)— non-contracted indices ofT β froms 4 1 (jρ+1, ..., jOβ)— contracted indices ofT β froms 4 1 (k1, ..., kϱ)— non-contracted indices ofT α ins 4 2 (kϱ+1, ..., kOα)— contracted indices ofT α ins 4 ...
[71]

Figure 16: Index map

∆ is non-negative because we only (possibly) removed contractions in the creation ofs 4 1. Figure 16: Index map. Dotted boxes are contracted modes. Next, observe that for each contrac- tion removed in the creation of s4 1, there must exist some k index which is part of the corresponding coupling. This is because of how C1 is defined. Dually, for each cont...
[72]

The complete system of index equivalences is given below: (i1, ..., im) = (k1, ..., kϱ) (j1, ..., jm−n) = (kn, ..., kϱ) (j1, ..., jρ) = (in, ..., iσ+∆) (iσ+1, ..., iσ+∆) = (kϱ+1, ..., kϱ+∆) (jρ+1, ..., jOβ) = (iσ+∆+1, ..., iCO−p) Let [[M ′ n]]×..×[[M ′ n+ρ]] be the multi-index set defined by the non-contracted modes of A1, where A1 is the hyper-tensor def...
[73]

α−1Y a=1 A1[j1 :j Oβ] [a] # = X (jρ+1:jOβ )

From here on, we write (i1 :i n) as shorthand for (i1, ..., in). We can now evaluates4 1: Tβ : [[M ′ n]]×..×[[M ′ n+ρ]]→F (j1 :j ρ) = (in :i σ+∆)7→ X (jρ+1:jOβ ) " α−1Y a=1 A1[j1 :j Oβ] [a] # = X (jρ+1:jOβ ) " α−1Y a=1 A[∅=i 1 :i n−1, j1 :j Oβ , iσ+∆+1 :i CO =∅] [a] # (5) Here we have used an empty index to indicate broadcasting, i.e., indexing into dummy...
[74]

Then, s4 1 and s4 2 can be combined into an equivalent tensor operation of arityα+α ′ −1. Proof. These hypotheses are simply a restatement of the properties satisfied by the tensor operations s4 1 and s4 2 constructed in the proof of Theorem A.35. Therefore, they are equivalent to the tensor operation defined by: s4 :={T∈s 4 1 :Tis a tensor} ∪ {T∈s 4 2 \T...
[75]

operand factored

The tensor lengths align by assumption. 41 A.2.4 Additional Examples of Tensor Operations We now provide more examples of tensor operations to facilitate understanding of the main paper. Throughout this section, we assume that F=R , and that the base operations of all tensor operations are real number multiplication and addition. Jagged Matrix Multiplicat...
[76]

compli- cated

use compatible base operations with the linear transformations of the CNM. It is interesting to observe that the copresheaf framework is effectively a categorical language for the description of TEMs similar to that of Figure 22. An important detail is that the transformations that can be implemented in a CNM are strictly linear, meaning that in particula...

2017