The Symmetries of Three-Layer ReLU Networks

Guido Mont\'ufar; Johanna Marie Gegenfurtner; Moritz Grillo

arxiv: 2605.18319 · v1 · pith:IAYCOYYWnew · submitted 2026-05-18 · 💻 cs.LG · cs.DM· math.AG· math.CO

The Symmetries of Three-Layer ReLU Networks

Johanna Marie Gegenfurtner , Moritz Grillo , Guido Mont\'ufar This is my paper

Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3

classification 💻 cs.LG cs.DMmath.AGmath.CO

keywords ReLU networksparameter symmetriesfunctional equivalencesemi-algebraic setsbottleneck architecturesthree-layer networksgradient flow

0 comments

The pith

Three-layer bottleneck ReLU networks have their generic parameter fibers completely characterized by explicit semi-algebraic descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for analyzing parameter symmetries in deep ReLU networks. It focuses on three-layer bottleneck architectures to give a full description of the generic parameter fibers using semi-algebraic sets. This yields a polynomial-time algorithm for determining whether two sets of parameters define the same function. The symmetries come from how layers compose and whether they hide or keep geometric information. Certain symmetries create local conservation laws in the gradient flow, while others do not.

Core claim

We obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures, providing explicit semi-algebraic descriptions of these fibers and a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Some of these symmetries induce local conservation laws along gradient flow, while others do not.

What carries the argument

Semi-algebraic descriptions of the generic parameter fibers, which capture the discrete and continuous symmetries induced by layer composition in three-layer ReLU bottleneck networks.

If this is right

Functional equivalence of two parameter sets can be decided in polynomial time.
Symmetries arise as discrete permutations or continuous transformations depending on whether layers hide or preserve preceding geometric structure.
Some symmetries produce local conservation laws along gradient flow while others do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The semi-algebraic approach could be tested on small explicit examples to verify the fiber descriptions before scaling.
Conservation laws might translate into invariant quantities that constrain training trajectories in these architectures.
The framework opens a route to study how symmetries change when moving from bottleneck to wider three-layer nets.

Load-bearing premise

The characterization holds for generic parameters in the three-layer bottleneck regime without extra degeneracies from specific data or initialization.

What would settle it

Two parameters that realize the same function but lie outside the described semi-algebraic fiber for a generic three-layer bottleneck ReLU network would falsify the complete characterization.

Figures

Figures reproduced from arXiv: 2605.18319 by Guido Mont\'ufar, Johanna Marie Gegenfurtner, Moritz Grillo.

**Figure 1.** Figure 1: Left: Illustration of Theorem 2. The orange hyperplanes correspond to first-layer neurons, and the cyan bent hyperplane to a second-layer neuron. The portions of the first-layer hyperplanes that intersect the active region of the second-layer neuron are highlighted in green. Right: Illustration of sign-flipping mechanism. The region R is shown in yellow. Hyperplanes of nonlinear neurons that are active on … view at source ↗

**Figure 2.** Figure 2: Parameters in the fiber when there is a hidden hyperplane. Two first-layer hyperplanes are [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers and yields a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Finally, we show that some of these symmetries induce local conservation laws along gradient flow, while others do not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a framework for analyzing parameter symmetries in deep ReLU networks and claims a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. It provides explicit semi-algebraic descriptions of these fibers and a polynomial-time algorithm for deciding functional equivalence of two parameters. Symmetries arise from layer composition and geometric preservation properties, with some inducing local conservation laws along gradient flow.

Significance. If the derivations hold, the explicit semi-algebraic fiber descriptions and polynomial-time equivalence test would be a solid contribution to neural network identifiability and loss-landscape analysis. The distinction between symmetries that preserve versus hide geometric structure is a useful organizing principle, and the link to conservation laws along gradient flow is a nice bridge to dynamics. No machine-checked proofs or reproducible code are mentioned, but the mathematical framing is self-contained.

major comments (2)

[Abstract and §3] Abstract and §3 (Main Results): The polynomial-time algorithm is stated to decide functional equivalence for arbitrary parameter vectors, yet the semi-algebraic fiber description is supplied only for generic points. No separate case analysis or reduction is indicated for degenerate loci (e.g., linearly dependent rows in the bottleneck matrix that produce identical ReLU activation patterns), which could introduce additional discrete or continuous symmetries not captured by the generic description. This directly affects correctness of the decision procedure.
[§4.2] §4.2 (Algorithm): It is unclear whether the equivalence test first checks membership in the generic stratum or applies a preliminary degeneracy detector; without an explicit handling step whose complexity remains polynomial, the claimed polynomial runtime may not extend to all inputs.

minor comments (2)

[§3] Notation for the semi-algebraic sets could be clarified with an explicit list of the defining polynomials and inequalities in one place.
[Figures] Figure 2 (or equivalent) illustrating a non-generic fiber would help readers see the difference from the generic case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript on symmetries in three-layer ReLU networks. The comments correctly highlight the need to clarify how our results on generic fibers extend to the polynomial-time equivalence test for arbitrary parameters. We address each point below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Main Results): The polynomial-time algorithm is stated to decide functional equivalence for arbitrary parameter vectors, yet the semi-algebraic fiber description is supplied only for generic points. No separate case analysis or reduction is indicated for degenerate loci (e.g., linearly dependent rows in the bottleneck matrix that produce identical ReLU activation patterns), which could introduce additional discrete or continuous symmetries not captured by the generic description. This directly affects correctness of the decision procedure.

Authors: We agree that the explicit semi-algebraic fiber descriptions are stated for generic parameters, as emphasized in the title, abstract, and §3. The polynomial-time equivalence test, however, is designed to apply to arbitrary inputs by first detecting degeneracy through rank computations on the weight matrices (e.g., checking linear dependence of rows in the bottleneck layer), which can be performed in polynomial time. Upon detecting a degenerate locus, the test reduces the instance to an equivalent lower-dimensional generic case by merging redundant neurons or adjusts the invariants accordingly, while still correctly deciding functional equivalence. This reduction preserves correctness because functional equivalence is determined by the induced network function, which the invariants capture even in degenerate strata. We will add a dedicated paragraph and case analysis in §3 to make the reduction explicit. revision: yes
Referee: [§4.2] §4.2 (Algorithm): It is unclear whether the equivalence test first checks membership in the generic stratum or applies a preliminary degeneracy detector; without an explicit handling step whose complexity remains polynomial, the claimed polynomial runtime may not extend to all inputs.

Authors: The referee is right that §4.2 does not currently spell out the preliminary degeneracy detector. The algorithm begins with a polynomial-time check (matrix rank computations in O(d^3) time for dimension d) to classify the input as generic or degenerate. For degenerate inputs it invokes a separate but still polynomial branch that accounts for extra symmetries by collapsing equivalent activation patterns before applying the core invariants. We will revise the pseudocode, add explicit complexity analysis for the detector, and include a short proof sketch that the overall procedure remains polynomial for all inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: symmetries derived from layer composition and geometry

full rationale

The paper's central results consist of explicit semi-algebraic descriptions of generic parameter fibers together with a polynomial-time equivalence test. These are obtained by analyzing how ReLU activations and linear maps compose across layers while preserving or hiding geometric structure. No equation or algorithm reduces by construction to a quantity defined in terms of the final fiber description itself, nor does any load-bearing step rely on a self-citation whose content is merely the present claim restated. The derivation remains self-contained against the stated assumptions on generic parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework relies on standard algebraic geometry facts about semi-algebraic sets and properties of ReLU activations; no free parameters are introduced and no new entities are postulated.

axioms (2)

domain assumption ReLU networks admit parameter transformations arising from layer composition that preserve the realized function.
Invoked in the abstract when describing symmetries from layer composition.
domain assumption Generic parameters in the three-layer bottleneck architecture have fibers that admit explicit semi-algebraic descriptions.
Central modeling choice stated in the abstract for the complete characterization.

pith-pipeline@v0.9.0 · 5618 in / 1355 out tokens · 27460 ms · 2026-05-20T12:08:01.681269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The fiber of a parameter θ∈ΘA is S(θ):={η∈ΘA | fη=fθ}. ... permutation of neurons ... positive rescaling ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka

URLhttps://arxiv.org/abs/2104.05508. Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=q8qLAbQBupm. Bo Zhao, Iordan Ganev, Robin Walt...

work page doi:10.7557/18.5150 2021
[2]

URLhttps://proceedings.mlr.press/v291/froese25a.html. 12 Appendix Contents A Background on Polyhedral Geometry 13 B Polyhedral Geometry of ReLU Networks 14 C Layerwise Fibers 18 D Fibers from Layer Composition 25 E Three-Layer Bottleneck Fibers 28 F Implications for Deeper Networks 34 G Discussion: Beyond the Generic Bottleneck Regime 36 A Background on P...

work page
[3]

ifP∈ C, then all faces ofPare inC, and

work page
[4]

For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C

ifP, P ′ ∈ CandP∩P ′ ̸=∅, thenP∩P ′ is a face of bothPandP ′. For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C. We call Cd theregions, Cd−1 thefacets, and Cd−2 theridgesof C. Thedimensionof a complex C is the maximal dimension of its polyhedra. Given a faceσ∈ C, we denote byaff(σ)⊆R d the unique smallest aff...

work page 2015
[5]

Supertransversality: every faceτ∈ C d−k θ is contained in exactlykbent hyperplanes

work page
[6]

We denote the set of generic parameters by ˜ΘA ⊆Θ A

Maximal rank of masked products: for every ℓ∈[L+ 1] and every sequence of index sets Si ⊆[n i] for i≤ℓ−1 , the matrix product W (ℓ)DSℓ−1 W (ℓ−1) · · ·D S1 W (1) attains the maximum possible rank, namelymin d, n ℓ,min 1≤i≤ℓ−1 |Si| . We denote the set of generic parameters by ˜ΘA ⊆Θ A. Lemma 20.The set ˜ΘA is open and dense inΘ A. B.3 Breakpoint Complexes a...

work page 2026
[7]

In particular,H∩R= aff(σ 1)∩R

If # starBθ(τ) d−1 = 4, then there are exactly two facetsσ1, σ2 ∈star Bθ(τ) d−1 such that eσ1/τ = −eσ2/τ ,and these are precisely the facets contained inH. In particular,H∩R= aff(σ 1)∩R

work page
[8]

In particular, H∩R= aff(σ)∩R

If # starBθ(τ) d−1 = 3 , then there is a unique facet σ∈star Bθ(τ) d−1 that is not adjacent to a region on which fθ is constant, and this facet is the one contained in H. In particular, H∩R= aff(σ)∩R. B.6 Visibility of First-Layer Hyperplanes in Three-Layer Networks We now prove the visibility result stated as Theorem 2 in the main text. The key point is ...

work page
[9]

If H∈ H 1(θ) and there exists a neuron j∈[n 2] together with a point x∈H such that z(2,θ) j (x)>0, thenH∈ H 1(η)

work page
[10]

Proof of Theorem 2

In particular, if this holds for everyH∈ H 1(θ), thenH 1(θ) =H 1(η)andC θ =C η. Proof of Theorem 2. Let H∈ H 1(θ). By assumption, there exists x∈H such that z(2,θ) j (x)>0 . We consider two cases based on the behavior ofz (2,θ) j onH: Case 1:There also exists x′ ∈H such that z(2,θ) j (x′)<0 . By continuity, there must be a point on H where z(2,θ) j = 0, m...

work page
[11]

Breakpoint Structure: Nonlinear neurons align with hyperplanes according to their assignment and orientation. • For visible neurons, the parameters are fixed as: (W (1) i , b(1) i ) =o(i)(a ϕ(i), tϕ(i))∀i∈N vis.(4) • For canceling neurons, any two in the same group, j, i∈ϕ −1(ℓ), share the same hyperplane and are oriented according tooas: (W (1) i , b(1) ...

work page
[12]

, k}, 0ifj∈ {k+ 1,

Weight Consistency: The output weights for each neuron group j add to the corresponding weight: X i∈ϕ−1(j) o(i)W (2) :,i = c(j)ifj∈ {1, . . . , k}, 0ifj∈ {k+ 1, . . . , n}. (7)

work page
[13]

Affine Base Map: The neurons active on the reference regionR must match (AR, bR). A nonlinear neuroni∈N vis ∪N can is active onRif and only ifo(i) =−1: X i∈Nlin W (2) :,i W (1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i W (1) i =A R,(8) X i∈Nlin W (2) :,i b(1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i b(1) i +b (2) =b R.(9) We denote byS(θ) ∼the fiber modulo positive ...

work page
[14]

Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation

Every nonlinear neuron i∈N vis corresponds to a unique hyperplane Hj ∈ H . Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation

work page
[15]

ThusN can =∅

There are no canceling neurons. ThusN can =∅. Next we prove our result describing the generic fiber of a one-hidden-layer network. Proposition 5.Let θ be a generic and minimal parameter. Then the generic fiber is given by the disjoint union ˜S(θ) ∼= S S⊆[k] KS ∩ ˜ΘA ∼, where the union is taken over orientations S⊆[k], andK S ={p S} ×V S is the semi-algebr...

work page
[16]

Ifn−k <min{d, m}, thendim(V S) = (n−k) 2 if and only ifS=S θ, andV S =∅otherwise

work page
[17]

Ifn−k=d≤m, thendim(V S) = (n−k) 2 for allS⊆[k]

work page
[18]

If n−k=m < d , then dim(VS) = (n−k) 2 if and only if cone(AS)⊆col(A S) is pointed, and VS =∅otherwise, whereA S is the compensation defined in(1)

work page
[19]

Proof of Theorem 6.LetJ= [n]\[k]

Ifn−k=m+ 1≤d, thendim(V S) =m 2 +m+dfor allS⊆[k]. Proof of Theorem 6.LetJ= [n]\[k]. For alli∈[k]we have that c(i)a⊤ i =o θ(i)W (2,θ) :,i W (1,θ) i .(10) Recall that on the reference regionRwe have that fη(x) =A Rx+b R for all η∈ S(θ). Further, we have that AR, bR split into contributions from the linear and non-linear neurons. For θ, let oθ(i) be the hype...

work page
[20]

Ify >0andy+t >0, thenz (2,η) i (x)−z (2,θ) i (x) = 0

work page
[21]

Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0

Ify >0andy+t≤0, thent <0andz (2,η) i (x)≥z (2,θ) i (x). Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0. For the first inequality we used W (2) ik ≤0 for all k̸=j ; for the second equality we used y+t≤0; and for the last inequality we usedb (2) i ≤0,W (2) ij >0, andt≤0. 25

work page
[22]

Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0

Ify≤0andy+t >0, thent >0andz (2,η) i (x)≤z (2,θ) i (x). Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0

work page
[23]

In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ)

Ify≤0andy+t≤0, then we have thatz (2,θ) i (x)≤0as well asz (2,η) i (x)≤0. In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ). Since η was arbitrary in Tj(θ), it follows thatT j(θ)⊆ ˜S(θ), as claimed. The Discrete Symmetry The group Gj is generated by the singleton sets {k} for k∈ {1, . . . , n 1} \ {j} , i.e., Gj = M{k} : k∈ {1, . . . , n 1...

work page
[24]

For every subsetI⊆[n 1], there exists a unique regionP∈ C d θ,1 such thatS 1(P) =I

work page
[25]

The statements follow from established results on hyperplane arrangements (see, e.g., Orlik and Terao, 1992, Chapter 2)

For every pair of disjoint subsets I, Z⊆[n 1], there exists a non-empty face σ∈ C θ,1 such that σ⊆ T k∈Z Hk andS 1(σ) =I Proof. The statements follow from established results on hyperplane arrangements (see, e.g., Orlik and Terao, 1992, Chapter 2). We offer a proof for completeness. LetHk ={x∈R d |z (1,θ) k (x) = 0} fork∈[n 1]. Sinceθis generic, the first...

work page 1992
[26]

If P∩B j(θ)̸=∅ , then the facet P∩B j(θ) is parallel to a first-layer hyperplaneH i ∈ H 1(θ)if and only ifS 1(P) ={i}

Let P∈ C d θ,1 and let j∈[n 2]. If P∩B j(θ)̸=∅ , then the facet P∩B j(θ) is parallel to a first-layer hyperplaneH i ∈ H 1(θ)if and only ifS 1(P) ={i}

work page
[27]

linear evidence

Let σ1, σ2 ∈ C d−1 θ be second-layer facets. Then σ1 and σ2 are parallel if and only if there exist i∈[n 1]and a regionP∈ C d θ,1 such thatS 1(P) ={i}and bothσ 1 andσ 2 are contained inP. 28 Proof. On a fixed region P∈ C d θ,1, the first-layer activation pattern is constant. Hence each second-layer preactivation z(2,θ) j restricts to an affine-linear func...

work page 2026

[1] [1]

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka

URLhttps://arxiv.org/abs/2104.05508. Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=q8qLAbQBupm. Bo Zhao, Iordan Ganev, Robin Walt...

work page doi:10.7557/18.5150 2021

[2] [2]

URLhttps://proceedings.mlr.press/v291/froese25a.html. 12 Appendix Contents A Background on Polyhedral Geometry 13 B Polyhedral Geometry of ReLU Networks 14 C Layerwise Fibers 18 D Fibers from Layer Composition 25 E Three-Layer Bottleneck Fibers 28 F Implications for Deeper Networks 34 G Discussion: Beyond the Generic Bottleneck Regime 36 A Background on P...

work page

[3] [3]

ifP∈ C, then all faces ofPare inC, and

work page

[4] [4]

For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C

ifP, P ′ ∈ CandP∩P ′ ̸=∅, thenP∩P ′ is a face of bothPandP ′. For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C. We call Cd theregions, Cd−1 thefacets, and Cd−2 theridgesof C. Thedimensionof a complex C is the maximal dimension of its polyhedra. Given a faceσ∈ C, we denote byaff(σ)⊆R d the unique smallest aff...

work page 2015

[5] [5]

Supertransversality: every faceτ∈ C d−k θ is contained in exactlykbent hyperplanes

work page

[6] [6]

We denote the set of generic parameters by ˜ΘA ⊆Θ A

Maximal rank of masked products: for every ℓ∈[L+ 1] and every sequence of index sets Si ⊆[n i] for i≤ℓ−1 , the matrix product W (ℓ)DSℓ−1 W (ℓ−1) · · ·D S1 W (1) attains the maximum possible rank, namelymin d, n ℓ,min 1≤i≤ℓ−1 |Si| . We denote the set of generic parameters by ˜ΘA ⊆Θ A. Lemma 20.The set ˜ΘA is open and dense inΘ A. B.3 Breakpoint Complexes a...

work page 2026

[7] [7]

In particular,H∩R= aff(σ 1)∩R

If # starBθ(τ) d−1 = 4, then there are exactly two facetsσ1, σ2 ∈star Bθ(τ) d−1 such that eσ1/τ = −eσ2/τ ,and these are precisely the facets contained inH. In particular,H∩R= aff(σ 1)∩R

work page

[8] [8]

In particular, H∩R= aff(σ)∩R

If # starBθ(τ) d−1 = 3 , then there is a unique facet σ∈star Bθ(τ) d−1 that is not adjacent to a region on which fθ is constant, and this facet is the one contained in H. In particular, H∩R= aff(σ)∩R. B.6 Visibility of First-Layer Hyperplanes in Three-Layer Networks We now prove the visibility result stated as Theorem 2 in the main text. The key point is ...

work page

[9] [9]

If H∈ H 1(θ) and there exists a neuron j∈[n 2] together with a point x∈H such that z(2,θ) j (x)>0, thenH∈ H 1(η)

work page

[10] [10]

Proof of Theorem 2

In particular, if this holds for everyH∈ H 1(θ), thenH 1(θ) =H 1(η)andC θ =C η. Proof of Theorem 2. Let H∈ H 1(θ). By assumption, there exists x∈H such that z(2,θ) j (x)>0 . We consider two cases based on the behavior ofz (2,θ) j onH: Case 1:There also exists x′ ∈H such that z(2,θ) j (x′)<0 . By continuity, there must be a point on H where z(2,θ) j = 0, m...

work page

[11] [11]

Breakpoint Structure: Nonlinear neurons align with hyperplanes according to their assignment and orientation. • For visible neurons, the parameters are fixed as: (W (1) i , b(1) i ) =o(i)(a ϕ(i), tϕ(i))∀i∈N vis.(4) • For canceling neurons, any two in the same group, j, i∈ϕ −1(ℓ), share the same hyperplane and are oriented according tooas: (W (1) i , b(1) ...

work page

[12] [12]

, k}, 0ifj∈ {k+ 1,

Weight Consistency: The output weights for each neuron group j add to the corresponding weight: X i∈ϕ−1(j) o(i)W (2) :,i = c(j)ifj∈ {1, . . . , k}, 0ifj∈ {k+ 1, . . . , n}. (7)

work page

[13] [13]

Affine Base Map: The neurons active on the reference regionR must match (AR, bR). A nonlinear neuroni∈N vis ∪N can is active onRif and only ifo(i) =−1: X i∈Nlin W (2) :,i W (1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i W (1) i =A R,(8) X i∈Nlin W (2) :,i b(1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i b(1) i +b (2) =b R.(9) We denote byS(θ) ∼the fiber modulo positive ...

work page

[14] [14]

Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation

Every nonlinear neuron i∈N vis corresponds to a unique hyperplane Hj ∈ H . Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation

work page

[15] [15]

ThusN can =∅

There are no canceling neurons. ThusN can =∅. Next we prove our result describing the generic fiber of a one-hidden-layer network. Proposition 5.Let θ be a generic and minimal parameter. Then the generic fiber is given by the disjoint union ˜S(θ) ∼= S S⊆[k] KS ∩ ˜ΘA ∼, where the union is taken over orientations S⊆[k], andK S ={p S} ×V S is the semi-algebr...

work page

[16] [16]

Ifn−k <min{d, m}, thendim(V S) = (n−k) 2 if and only ifS=S θ, andV S =∅otherwise

work page

[17] [17]

Ifn−k=d≤m, thendim(V S) = (n−k) 2 for allS⊆[k]

work page

[18] [18]

If n−k=m < d , then dim(VS) = (n−k) 2 if and only if cone(AS)⊆col(A S) is pointed, and VS =∅otherwise, whereA S is the compensation defined in(1)

work page

[19] [19]

Proof of Theorem 6.LetJ= [n]\[k]

Ifn−k=m+ 1≤d, thendim(V S) =m 2 +m+dfor allS⊆[k]. Proof of Theorem 6.LetJ= [n]\[k]. For alli∈[k]we have that c(i)a⊤ i =o θ(i)W (2,θ) :,i W (1,θ) i .(10) Recall that on the reference regionRwe have that fη(x) =A Rx+b R for all η∈ S(θ). Further, we have that AR, bR split into contributions from the linear and non-linear neurons. For θ, let oθ(i) be the hype...

work page

[20] [20]

Ify >0andy+t >0, thenz (2,η) i (x)−z (2,θ) i (x) = 0

work page

[21] [21]

Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0

Ify >0andy+t≤0, thent <0andz (2,η) i (x)≥z (2,θ) i (x). Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0. For the first inequality we used W (2) ik ≤0 for all k̸=j ; for the second equality we used y+t≤0; and for the last inequality we usedb (2) i ≤0,W (2) ij >0, andt≤0. 25

work page

[22] [22]

Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0

Ify≤0andy+t >0, thent >0andz (2,η) i (x)≤z (2,θ) i (x). Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0

work page

[23] [23]

In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ)

Ify≤0andy+t≤0, then we have thatz (2,θ) i (x)≤0as well asz (2,η) i (x)≤0. In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ). Since η was arbitrary in Tj(θ), it follows thatT j(θ)⊆ ˜S(θ), as claimed. The Discrete Symmetry The group Gj is generated by the singleton sets {k} for k∈ {1, . . . , n 1} \ {j} , i.e., Gj = M{k} : k∈ {1, . . . , n 1...

work page

[24] [24]

For every subsetI⊆[n 1], there exists a unique regionP∈ C d θ,1 such thatS 1(P) =I

work page

[25] [25]

The statements follow from established results on hyperplane arrangements (see, e.g., Orlik and Terao, 1992, Chapter 2)

For every pair of disjoint subsets I, Z⊆[n 1], there exists a non-empty face σ∈ C θ,1 such that σ⊆ T k∈Z Hk andS 1(σ) =I Proof. The statements follow from established results on hyperplane arrangements (see, e.g., Orlik and Terao, 1992, Chapter 2). We offer a proof for completeness. LetHk ={x∈R d |z (1,θ) k (x) = 0} fork∈[n 1]. Sinceθis generic, the first...

work page 1992

[26] [26]

If P∩B j(θ)̸=∅ , then the facet P∩B j(θ) is parallel to a first-layer hyperplaneH i ∈ H 1(θ)if and only ifS 1(P) ={i}

Let P∈ C d θ,1 and let j∈[n 2]. If P∩B j(θ)̸=∅ , then the facet P∩B j(θ) is parallel to a first-layer hyperplaneH i ∈ H 1(θ)if and only ifS 1(P) ={i}

work page

[27] [27]

linear evidence

Let σ1, σ2 ∈ C d−1 θ be second-layer facets. Then σ1 and σ2 are parallel if and only if there exist i∈[n 1]and a regionP∈ C d θ,1 such thatS 1(P) ={i}and bothσ 1 andσ 2 are contained inP. 28 Proof. On a fixed region P∈ C d θ,1, the first-layer activation pattern is constant. Hence each second-layer preactivation z(2,θ) j restricts to an affine-linear func...

work page 2026