Singular Learning and Occam's Razor in Deep Monomial Networks
Pith reviewed 2026-06-30 01:10 UTC · model grok-4.3
The pith
For large activation degrees, criticality in deep monomial networks occurs precisely at subnetworks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This characterization follows from applying Mason's Theorem to the Jacobian of the network parametrization in deep fully-connected monomial networks.
What carries the argument
Mason's Theorem applied to the Jacobian of the monomial network parametrization, which locates all rank deficiencies at subnetworks for large activation degree.
If this is right
- Gradient flow is steered toward parameter values that deactivate or redundant-ize neurons.
- The implicit bias favors functions realized by proper subnetworks over more complex realizations.
- Architectural simplicity is enforced algebraically at the level of the parametrization map.
Where Pith is reading between the lines
- The same algebraic reduction may be tested numerically by tracking whether random initializations avoid non-subnetwork singularities under gradient descent.
- If the pattern persists beyond monomials, it could link singular learning directly to effective model pruning during training.
Load-bearing premise
The precise identification of criticality with subnetworks holds only because the activations are monomials and Mason's Theorem applies to their polynomial Jacobian structure.
What would settle it
Finding even one parameter setting that is not a subnetwork yet produces a rank-deficient Jacobian for a sufficiently large monomial degree would falsify the claim.
read the original abstract
In the optimization of neural networks, gradient dynamics are influenced by critical points that arise from the model's architecture. These critical points occur where the Jacobian of the model's parametrization is rank-deficient, and are the most pronounced singularities studied in Singular Learning Theory. We investigate such points in deep fully-connected networks with monomial activations via tools from polynomial algebra such as Mason's Theorem. We show that, for sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This offers a mathematical perspective on the implicit bias in deep neural networks, explaining the tendency of these models to converge toward simpler functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in deep fully-connected networks using monomial activations, for sufficiently large activation degree, the points where the Jacobian of the parametrization is rank-deficient (critical points in the loss landscape) occur precisely at subnetwork configurations, i.e., where some neurons are inactive or redundant. This is derived via tools from polynomial algebra, notably Mason's Theorem applied to the Jacobian, and is positioned as explaining implicit bias toward simpler functions in deep networks.
Significance. If the central algebraic claim holds, the result supplies a precise characterization of singularities in a restricted but analytically tractable class of networks, directly linking singular learning theory to an Occam-like preference for subnetworks. The explicit use of Mason's Theorem to obtain an if-and-only-if statement is a technical strength that could seed further work on algebraic characterizations of criticality beyond the monomial case.
major comments (2)
- [§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.
- [p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.
minor comments (2)
- [§2.1] Notation for the monomial activation φ(x) = x^k is introduced without a clear statement of whether k is the same for all layers or allowed to differ; this affects the degree counting in the composed Jacobian.
- [Abstract and §4] The abstract states the result for 'deep' networks, yet all explicit calculations appear to be carried out for depth 2 before the general case is claimed; a short remark on the inductive step would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thorough review and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications in the revision.
read point-by-point responses
-
Referee: [§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.
Authors: We appreciate the referee drawing attention to the need for explicit verification of Mason's Theorem hypotheses under composition. The monomial activations ensure that distinct monomials remain coprime (gcd 1) at every layer, and this property is preserved under the network composition independently of depth because no new common polynomial factors are introduced by the monomial structure. Consequently, the if-and-only-if equivalence holds without extraneous roots affecting the rank condition. We will add a short supporting lemma in the revised §3 to make this verification explicit for arbitrary depth. revision: yes
-
Referee: [p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.
Authors: The referee correctly notes that an explicit bound would make the result more concrete. In the revision we will derive and state an explicit lower bound on the activation degree (a finite number depending on width and depth) obtained by ensuring the degree exceeds the maximum possible degree of any minor of the Jacobian; this bound is independent of the specific parameter values but depends on the fixed architecture. revision: yes
Circularity Check
No circularity: derivation applies external Mason's Theorem to monomial Jacobian
full rationale
The paper presents a direct algebraic argument that rank deficiency of the parametrization Jacobian occurs precisely at subnetwork loci for large monomial degree, by invoking Mason's Theorem on the relevant polynomials. This step is not self-definitional, does not rename a fitted quantity as a prediction, and does not rest on a load-bearing self-citation whose content is itself unverified. The central claim therefore retains independent mathematical content from the cited algebraic theorem and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mason's Theorem from polynomial algebra applies directly to the Jacobian of the monomial network parametrization to characterize rank deficiency.
Reference graph
Works this paper leans on
-
[1]
Robustness Verification of Polynomial Neural Networks
[ADM26] Yulia Alexandr, Hao Duan, and Guido Montúfar. Robustness verification of polyno- mial neural networks.arXiv:2602.06105,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Linear independence of powers for polynomials
[Cr˘a25] Alexandru Cr ˘aciun. Linear independence of powers for polynomials. arXiv:2507.10163,
-
[3]
Another generalization of mason’s abc-theorem.arXiv:0707.0434,
[dB07] Michiel de Bondt. Another generalization of mason’s abc-theorem.arXiv:0707.0434,
-
[4]
[DR26] Kevin Dao and Jose Israel Rodriguez. Minimal filling architectures of polynomial neural networks: Counterexamples, frontier search, and defects.arXiv:2605.09609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,
[FRWY24] Bella Finkel, Jose Israel Rodriguez, Chenxi Wu, and Thomas Yahl. Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,
-
[6]
Most ReLU Networks Admit Identifiable Parameters
[GM26] Moritz Grillo and Guido Montúfar. Most relu networks admit identifiable parameters. arXiv:2605.03601,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,
[HPL26] Yacoub Hendi, Daniel Persson, and Magdalena Larfors. The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,
-
[8]
The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,
[LFW+23] Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,
-
[9]
Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization
[MCBK25] Giovanni Luca Marchetti, Erin Connelly, Paul Breiding, and Kathlén Kohn. Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization. arXiv:2512.21029,
-
[10]
Sequential Group Composition: A Window into the Mechanics of Deep Learning
[MKM+26] Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Mi- olane. Sequential group composition: A window into the mechanics of deep learning. arXiv:2602.03655,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,
[MM25] Alex Massarenti and Massimiliano Mella. The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,
-
[12]
Identifiable Equivariant Networks are Layerwise Equivariant
[SMBK26] Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, and Kathlén Kohn. Iden- tifiable equivariant networks are layerwise equivariant.arXiv:2601.21645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,
[UDBC25] Konstantin Usevich, Clara Dérand, Ricardo Borsoi, and Marianne Clausel. Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.