The Symmetries of Three-Layer ReLU Networks
Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3
The pith
Three-layer bottleneck ReLU networks have their generic parameter fibers completely characterized by explicit semi-algebraic descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures, providing explicit semi-algebraic descriptions of these fibers and a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Some of these symmetries induce local conservation laws along gradient flow, while others do not.
What carries the argument
Semi-algebraic descriptions of the generic parameter fibers, which capture the discrete and continuous symmetries induced by layer composition in three-layer ReLU bottleneck networks.
If this is right
- Functional equivalence of two parameter sets can be decided in polynomial time.
- Symmetries arise as discrete permutations or continuous transformations depending on whether layers hide or preserve preceding geometric structure.
- Some symmetries produce local conservation laws along gradient flow while others do not.
Where Pith is reading between the lines
- The semi-algebraic approach could be tested on small explicit examples to verify the fiber descriptions before scaling.
- Conservation laws might translate into invariant quantities that constrain training trajectories in these architectures.
- The framework opens a route to study how symmetries change when moving from bottleneck to wider three-layer nets.
Load-bearing premise
The characterization holds for generic parameters in the three-layer bottleneck regime without extra degeneracies from specific data or initialization.
What would settle it
Two parameters that realize the same function but lie outside the described semi-algebraic fiber for a generic three-layer bottleneck ReLU network would falsify the complete characterization.
Figures
read the original abstract
We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers and yields a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Finally, we show that some of these symmetries induce local conservation laws along gradient flow, while others do not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a framework for analyzing parameter symmetries in deep ReLU networks and claims a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. It provides explicit semi-algebraic descriptions of these fibers and a polynomial-time algorithm for deciding functional equivalence of two parameters. Symmetries arise from layer composition and geometric preservation properties, with some inducing local conservation laws along gradient flow.
Significance. If the derivations hold, the explicit semi-algebraic fiber descriptions and polynomial-time equivalence test would be a solid contribution to neural network identifiability and loss-landscape analysis. The distinction between symmetries that preserve versus hide geometric structure is a useful organizing principle, and the link to conservation laws along gradient flow is a nice bridge to dynamics. No machine-checked proofs or reproducible code are mentioned, but the mathematical framing is self-contained.
major comments (2)
- [Abstract and §3] Abstract and §3 (Main Results): The polynomial-time algorithm is stated to decide functional equivalence for arbitrary parameter vectors, yet the semi-algebraic fiber description is supplied only for generic points. No separate case analysis or reduction is indicated for degenerate loci (e.g., linearly dependent rows in the bottleneck matrix that produce identical ReLU activation patterns), which could introduce additional discrete or continuous symmetries not captured by the generic description. This directly affects correctness of the decision procedure.
- [§4.2] §4.2 (Algorithm): It is unclear whether the equivalence test first checks membership in the generic stratum or applies a preliminary degeneracy detector; without an explicit handling step whose complexity remains polynomial, the claimed polynomial runtime may not extend to all inputs.
minor comments (2)
- [§3] Notation for the semi-algebraic sets could be clarified with an explicit list of the defining polynomials and inequalities in one place.
- [Figures] Figure 2 (or equivalent) illustrating a non-generic fiber would help readers see the difference from the generic case.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript on symmetries in three-layer ReLU networks. The comments correctly highlight the need to clarify how our results on generic fibers extend to the polynomial-time equivalence test for arbitrary parameters. We address each point below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Main Results): The polynomial-time algorithm is stated to decide functional equivalence for arbitrary parameter vectors, yet the semi-algebraic fiber description is supplied only for generic points. No separate case analysis or reduction is indicated for degenerate loci (e.g., linearly dependent rows in the bottleneck matrix that produce identical ReLU activation patterns), which could introduce additional discrete or continuous symmetries not captured by the generic description. This directly affects correctness of the decision procedure.
Authors: We agree that the explicit semi-algebraic fiber descriptions are stated for generic parameters, as emphasized in the title, abstract, and §3. The polynomial-time equivalence test, however, is designed to apply to arbitrary inputs by first detecting degeneracy through rank computations on the weight matrices (e.g., checking linear dependence of rows in the bottleneck layer), which can be performed in polynomial time. Upon detecting a degenerate locus, the test reduces the instance to an equivalent lower-dimensional generic case by merging redundant neurons or adjusts the invariants accordingly, while still correctly deciding functional equivalence. This reduction preserves correctness because functional equivalence is determined by the induced network function, which the invariants capture even in degenerate strata. We will add a dedicated paragraph and case analysis in §3 to make the reduction explicit. revision: yes
-
Referee: [§4.2] §4.2 (Algorithm): It is unclear whether the equivalence test first checks membership in the generic stratum or applies a preliminary degeneracy detector; without an explicit handling step whose complexity remains polynomial, the claimed polynomial runtime may not extend to all inputs.
Authors: The referee is right that §4.2 does not currently spell out the preliminary degeneracy detector. The algorithm begins with a polynomial-time check (matrix rank computations in O(d^3) time for dimension d) to classify the input as generic or degenerate. For degenerate inputs it invokes a separate but still polynomial branch that accounts for extra symmetries by collapsing equivalent activation patterns before applying the core invariants. We will revise the pseudocode, add explicit complexity analysis for the detector, and include a short proof sketch that the overall procedure remains polynomial for all inputs. revision: yes
Circularity Check
No circularity: symmetries derived from layer composition and geometry
full rationale
The paper's central results consist of explicit semi-algebraic descriptions of generic parameter fibers together with a polynomial-time equivalence test. These are obtained by analyzing how ReLU activations and linear maps compose across layers while preserving or hiding geometric structure. No equation or algorithm reduces by construction to a quantity defined in terms of the final fiber description itself, nor does any load-bearing step rely on a self-citation whose content is merely the present claim restated. The derivation remains self-contained against the stated assumptions on generic parameters.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption ReLU networks admit parameter transformations arising from layer composition that preserve the realized function.
- domain assumption Generic parameters in the three-layer bottleneck architecture have fibers that admit explicit semi-algebraic descriptions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The fiber of a parameter θ∈ΘA is S(θ):={η∈ΘA | fη=fθ}. ... permutation of neurons ... positive rescaling ...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka
URLhttps://arxiv.org/abs/2104.05508. Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=q8qLAbQBupm. Bo Zhao, Iordan Ganev, Robin Walt...
-
[2]
URLhttps://proceedings.mlr.press/v291/froese25a.html. 12 Appendix Contents A Background on Polyhedral Geometry 13 B Polyhedral Geometry of ReLU Networks 14 C Layerwise Fibers 18 D Fibers from Layer Composition 25 E Three-Layer Bottleneck Fibers 28 F Implications for Deeper Networks 34 G Discussion: Beyond the Generic Bottleneck Regime 36 A Background on P...
-
[3]
ifP∈ C, then all faces ofPare inC, and
-
[4]
For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C
ifP, P ′ ∈ CandP∩P ′ ̸=∅, thenP∩P ′ is a face of bothPandP ′. For a polyhedral complex C in Rd and k≤d we denote by Ck the set of k-dimensional polyhedra in C. We call Cd theregions, Cd−1 thefacets, and Cd−2 theridgesof C. Thedimensionof a complex C is the maximal dimension of its polyhedra. Given a faceσ∈ C, we denote byaff(σ)⊆R d the unique smallest aff...
work page 2015
-
[5]
Supertransversality: every faceτ∈ C d−k θ is contained in exactlykbent hyperplanes
-
[6]
We denote the set of generic parameters by ˜ΘA ⊆Θ A
Maximal rank of masked products: for every ℓ∈[L+ 1] and every sequence of index sets Si ⊆[n i] for i≤ℓ−1 , the matrix product W (ℓ)DSℓ−1 W (ℓ−1) · · ·D S1 W (1) attains the maximum possible rank, namelymin d, n ℓ,min 1≤i≤ℓ−1 |Si| . We denote the set of generic parameters by ˜ΘA ⊆Θ A. Lemma 20.The set ˜ΘA is open and dense inΘ A. B.3 Breakpoint Complexes a...
work page 2026
-
[7]
If # starBθ(τ) d−1 = 4, then there are exactly two facetsσ1, σ2 ∈star Bθ(τ) d−1 such that eσ1/τ = −eσ2/τ ,and these are precisely the facets contained inH. In particular,H∩R= aff(σ 1)∩R
-
[8]
If # starBθ(τ) d−1 = 3 , then there is a unique facet σ∈star Bθ(τ) d−1 that is not adjacent to a region on which fθ is constant, and this facet is the one contained in H. In particular, H∩R= aff(σ)∩R. B.6 Visibility of First-Layer Hyperplanes in Three-Layer Networks We now prove the visibility result stated as Theorem 2 in the main text. The key point is ...
-
[9]
If H∈ H 1(θ) and there exists a neuron j∈[n 2] together with a point x∈H such that z(2,θ) j (x)>0, thenH∈ H 1(η)
-
[10]
In particular, if this holds for everyH∈ H 1(θ), thenH 1(θ) =H 1(η)andC θ =C η. Proof of Theorem 2. Let H∈ H 1(θ). By assumption, there exists x∈H such that z(2,θ) j (x)>0 . We consider two cases based on the behavior ofz (2,θ) j onH: Case 1:There also exists x′ ∈H such that z(2,θ) j (x′)<0 . By continuity, there must be a point on H where z(2,θ) j = 0, m...
-
[11]
Breakpoint Structure: Nonlinear neurons align with hyperplanes according to their assignment and orientation. • For visible neurons, the parameters are fixed as: (W (1) i , b(1) i ) =o(i)(a ϕ(i), tϕ(i))∀i∈N vis.(4) • For canceling neurons, any two in the same group, j, i∈ϕ −1(ℓ), share the same hyperplane and are oriented according tooas: (W (1) i , b(1) ...
-
[12]
Weight Consistency: The output weights for each neuron group j add to the corresponding weight: X i∈ϕ−1(j) o(i)W (2) :,i = c(j)ifj∈ {1, . . . , k}, 0ifj∈ {k+ 1, . . . , n}. (7)
-
[13]
Affine Base Map: The neurons active on the reference regionR must match (AR, bR). A nonlinear neuroni∈N vis ∪N can is active onRif and only ifo(i) =−1: X i∈Nlin W (2) :,i W (1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i W (1) i =A R,(8) X i∈Nlin W (2) :,i b(1) i + X i∈Nvis∪Ncan o(i)=−1 W (2) :,i b(1) i +b (2) =b R.(9) We denote byS(θ) ∼the fiber modulo positive ...
-
[14]
Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation
Every nonlinear neuron i∈N vis corresponds to a unique hyperplane Hj ∈ H . Thus k=|N vis|and the assignment mapϕrestricted toN vis is a permutation
-
[15]
There are no canceling neurons. ThusN can =∅. Next we prove our result describing the generic fiber of a one-hidden-layer network. Proposition 5.Let θ be a generic and minimal parameter. Then the generic fiber is given by the disjoint union ˜S(θ) ∼= S S⊆[k] KS ∩ ˜ΘA ∼, where the union is taken over orientations S⊆[k], andK S ={p S} ×V S is the semi-algebr...
-
[16]
Ifn−k <min{d, m}, thendim(V S) = (n−k) 2 if and only ifS=S θ, andV S =∅otherwise
-
[17]
Ifn−k=d≤m, thendim(V S) = (n−k) 2 for allS⊆[k]
-
[18]
If n−k=m < d , then dim(VS) = (n−k) 2 if and only if cone(AS)⊆col(A S) is pointed, and VS =∅otherwise, whereA S is the compensation defined in(1)
-
[19]
Proof of Theorem 6.LetJ= [n]\[k]
Ifn−k=m+ 1≤d, thendim(V S) =m 2 +m+dfor allS⊆[k]. Proof of Theorem 6.LetJ= [n]\[k]. For alli∈[k]we have that c(i)a⊤ i =o θ(i)W (2,θ) :,i W (1,θ) i .(10) Recall that on the reference regionRwe have that fη(x) =A Rx+b R for all η∈ S(θ). Further, we have that AR, bR split into contributions from the linear and non-linear neurons. For θ, let oθ(i) be the hype...
-
[20]
Ify >0andy+t >0, thenz (2,η) i (x)−z (2,θ) i (x) = 0
-
[21]
Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0
Ify >0andy+t≤0, thent <0andz (2,η) i (x)≥z (2,θ) i (x). Moreover, it holds that z(2,η) i (x)≤W (2) ij [y+t] + +b (2,η) i =b (2,η) i =b (2) i −tW (2) ij ≤0. For the first inequality we used W (2) ik ≤0 for all k̸=j ; for the second equality we used y+t≤0; and for the last inequality we usedb (2) i ≤0,W (2) ij >0, andt≤0. 25
-
[22]
Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0
Ify≤0andy+t >0, thent >0andz (2,η) i (x)≤z (2,θ) i (x). Moreover, it holds that z(2,θ) i (x)≤W (2) ij [y]+ +b (2) i =b (2) i ≤0
-
[23]
In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ)
Ify≤0andy+t≤0, then we have thatz (2,θ) i (x)≤0as well asz (2,η) i (x)≤0. In all four cases, a(2,η) i (x) =a (2,θ) i (x), and thus η∈ ˜S(θ). Since η was arbitrary in Tj(θ), it follows thatT j(θ)⊆ ˜S(θ), as claimed. The Discrete Symmetry The group Gj is generated by the singleton sets {k} for k∈ {1, . . . , n 1} \ {j} , i.e., Gj = M{k} : k∈ {1, . . . , n 1...
-
[24]
For every subsetI⊆[n 1], there exists a unique regionP∈ C d θ,1 such thatS 1(P) =I
-
[25]
For every pair of disjoint subsets I, Z⊆[n 1], there exists a non-empty face σ∈ C θ,1 such that σ⊆ T k∈Z Hk andS 1(σ) =I Proof. The statements follow from established results on hyperplane arrangements (see, e.g., Orlik and Terao, 1992, Chapter 2). We offer a proof for completeness. LetHk ={x∈R d |z (1,θ) k (x) = 0} fork∈[n 1]. Sinceθis generic, the first...
work page 1992
-
[26]
Let P∈ C d θ,1 and let j∈[n 2]. If P∩B j(θ)̸=∅ , then the facet P∩B j(θ) is parallel to a first-layer hyperplaneH i ∈ H 1(θ)if and only ifS 1(P) ={i}
-
[27]
Let σ1, σ2 ∈ C d−1 θ be second-layer facets. Then σ1 and σ2 are parallel if and only if there exist i∈[n 1]and a regionP∈ C d θ,1 such thatS 1(P) ={i}and bothσ 1 andσ 2 are contained inP. 28 Proof. On a fixed region P∈ C d θ,1, the first-layer activation pattern is constant. Hence each second-layer preactivation z(2,θ) j restricts to an affine-linear func...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.