Singular Learning and Occam's Razor in Deep Monomial Networks

Farhan Shabir; Giovanni Luca Marchetti; Kathl\'en Kohn; Vahid Shahverdi; Weisheng Wang

arxiv: 2606.28464 · v1 · pith:Z4TTOKXAnew · submitted 2026-06-26 · 💻 cs.LG

Singular Learning and Occam's Razor in Deep Monomial Networks

Kathl\'en Kohn , Giovanni Luca Marchetti , Farhan Shabir , Vahid Shahverdi , Weisheng Wang This is my paper

Pith reviewed 2026-06-30 01:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords singular learning theorymonomial activationscritical pointsJacobian rank deficiencysubnetworksimplicit biasOccam's razor

0 comments

The pith

For large activation degrees, criticality in deep monomial networks occurs precisely at subnetworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies critical points in the loss landscape of deep fully-connected networks that use monomial activations, defined as locations where the Jacobian of the parametrization has deficient rank. These points shape gradient-based optimization and are central to singular learning theory. By applying Mason's Theorem from polynomial algebra to the network's Jacobian, the work shows that when the monomial degree is high enough, every such critical point corresponds exactly to a subnetwork in which some neurons are inactive or redundant. This algebraic identification supplies a concrete mechanism for the observed tendency of these models to converge to simpler functions rather than more complex ones.

Core claim

For sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This characterization follows from applying Mason's Theorem to the Jacobian of the network parametrization in deep fully-connected monomial networks.

What carries the argument

Mason's Theorem applied to the Jacobian of the monomial network parametrization, which locates all rank deficiencies at subnetworks for large activation degree.

If this is right

Gradient flow is steered toward parameter values that deactivate or redundant-ize neurons.
The implicit bias favors functions realized by proper subnetworks over more complex realizations.
Architectural simplicity is enforced algebraically at the level of the parametrization map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same algebraic reduction may be tested numerically by tracking whether random initializations avoid non-subnetwork singularities under gradient descent.
If the pattern persists beyond monomials, it could link singular learning directly to effective model pruning during training.

Load-bearing premise

The precise identification of criticality with subnetworks holds only because the activations are monomials and Mason's Theorem applies to their polynomial Jacobian structure.

What would settle it

Finding even one parameter setting that is not a subnetwork yet produces a rank-deficient Jacobian for a sufficiently large monomial degree would falsify the claim.

read the original abstract

In the optimization of neural networks, gradient dynamics are influenced by critical points that arise from the model's architecture. These critical points occur where the Jacobian of the model's parametrization is rank-deficient, and are the most pronounced singularities studied in Singular Learning Theory. We investigate such points in deep fully-connected networks with monomial activations via tools from polynomial algebra such as Mason's Theorem. We show that, for sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This offers a mathematical perspective on the implicit bias in deep neural networks, explaining the tendency of these models to converge toward simpler functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mason's theorem pins criticality exactly to subnetworks in high-degree monomial nets, but the deep composition case needs checking.

read the letter

The main claim is that in deep fully-connected networks with monomial activations, the Jacobian drops rank exactly at subnetwork configurations once the degree is large enough, and Mason's theorem from polynomial algebra is the tool that forces this. This is the punchline.

What is new is the precise location of the singularities via this algebraic route. Prior singular learning work often describes the geometry more abstractly; here the authors reduce the rank condition on the parametrization map to a statement about common roots of certain polynomials and invoke Mason's theorem to conclude that only the subnetwork loci satisfy it. The link to implicit bias toward simpler functions follows directly from that characterization.

The paper sets up the monomial case cleanly and shows how the algebraic condition matches the intuitive picture of inactive or redundant neurons. That part is a solid, self-contained derivation rather than a fit to data.

The soft spot is the one the stress-test note flags. In a deep network the Jacobian entries are themselves high-degree polynomials coming from composition, and Mason's theorem requires coprimality or degree conditions to bound the solutions to exactly the desired loci. It is not immediate that those conditions hold uniformly across all minors once depth exceeds one, even for large activation degree. If extra parameter points make a minor vanish without forcing any neuron to zero, the "precisely" direction does not go through. The abstract leaves the size of "sufficiently large" and the handling of composition implicit, so the full proof needs to close that gap.

This is for readers already working in singular learning theory or algebraic approaches to neural network geometry. Someone looking for a concrete example where singularities correspond to simpler subfunctions will find the derivation useful. It deserves peer review because the claim is mathematically stated and can be checked or corrected on its own terms.

Referee Report

2 major / 2 minor

Summary. The paper claims that in deep fully-connected networks using monomial activations, for sufficiently large activation degree, the points where the Jacobian of the parametrization is rank-deficient (critical points in the loss landscape) occur precisely at subnetwork configurations, i.e., where some neurons are inactive or redundant. This is derived via tools from polynomial algebra, notably Mason's Theorem applied to the Jacobian, and is positioned as explaining implicit bias toward simpler functions in deep networks.

Significance. If the central algebraic claim holds, the result supplies a precise characterization of singularities in a restricted but analytically tractable class of networks, directly linking singular learning theory to an Occam-like preference for subnetworks. The explicit use of Mason's Theorem to obtain an if-and-only-if statement is a technical strength that could seed further work on algebraic characterizations of criticality beyond the monomial case.

major comments (2)

[§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.
[p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.

minor comments (2)

[§2.1] Notation for the monomial activation φ(x) = x^k is introduced without a clear statement of whether k is the same for all layers or allowed to differ; this affects the degree counting in the composed Jacobian.
[Abstract and §4] The abstract states the result for 'deep' networks, yet all explicit calculations appear to be carried out for depth 2 before the general case is claimed; a short remark on the inductive step would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications in the revision.

read point-by-point responses

Referee: [§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.

Authors: We appreciate the referee drawing attention to the need for explicit verification of Mason's Theorem hypotheses under composition. The monomial activations ensure that distinct monomials remain coprime (gcd 1) at every layer, and this property is preserved under the network composition independently of depth because no new common polynomial factors are introduced by the monomial structure. Consequently, the if-and-only-if equivalence holds without extraneous roots affecting the rank condition. We will add a short supporting lemma in the revised §3 to make this verification explicit for arbitrary depth. revision: yes
Referee: [p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.

Authors: The referee correctly notes that an explicit bound would make the result more concrete. In the revision we will derive and state an explicit lower bound on the activation degree (a finite number depending on width and depth) obtained by ensuring the degree exceeds the maximum possible degree of any minor of the Jacobian; this bound is independent of the specific parameter values but depends on the fixed architecture. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external Mason's Theorem to monomial Jacobian

full rationale

The paper presents a direct algebraic argument that rank deficiency of the parametrization Jacobian occurs precisely at subnetwork loci for large monomial degree, by invoking Mason's Theorem on the relevant polynomials. This step is not self-definitional, does not rename a fitted quantity as a prediction, and does not rest on a load-bearing self-citation whose content is itself unverified. The central claim therefore retains independent mathematical content from the cited algebraic theorem and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Mason's Theorem to the Jacobian rank condition and on the monomial activation form; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Mason's Theorem from polynomial algebra applies directly to the Jacobian of the monomial network parametrization to characterize rank deficiency.
Invoked to locate critical points; location not specified beyond the abstract's mention of the tool.

pith-pipeline@v0.9.1-grok · 5648 in / 1126 out tokens · 31711 ms · 2026-06-30T01:10:24.172337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Robustness Verification of Polynomial Neural Networks

[ADM26] Yulia Alexandr, Hao Duan, and Guido Montúfar. Robustness verification of polyno- mial neural networks.arXiv:2602.06105,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Linear independence of powers for polynomials

[Cr˘a25] Alexandru Cr ˘aciun. Linear independence of powers for polynomials. arXiv:2507.10163,

work page arXiv
[3]

Another generalization of mason’s abc-theorem.arXiv:0707.0434,

[dB07] Michiel de Bondt. Another generalization of mason’s abc-theorem.arXiv:0707.0434,

work page arXiv
[4]

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

[DR26] Kevin Dao and Jose Israel Rodriguez. Minimal filling architectures of polynomial neural networks: Counterexamples, frontier search, and defects.arXiv:2605.09609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

[FRWY24] Bella Finkel, Jose Israel Rodriguez, Chenxi Wu, and Thomas Yahl. Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

work page arXiv
[6]

Most ReLU Networks Admit Identifiable Parameters

[GM26] Moritz Grillo and Guido Montúfar. Most relu networks admit identifiable parameters. arXiv:2605.03601,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

[HPL26] Yacoub Hendi, Daniel Persson, and Magdalena Larfors. The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

work page arXiv
[8]

The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

[LFW+23] Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

work page arXiv
[9]

Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization

[MCBK25] Giovanni Luca Marchetti, Erin Connelly, Paul Breiding, and Kathlén Kohn. Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization. arXiv:2512.21029,

work page arXiv
[10]

Sequential Group Composition: A Window into the Mechanics of Deep Learning

[MKM+26] Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Mi- olane. Sequential group composition: A window into the mechanics of deep learning. arXiv:2602.03655,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

[MM25] Alex Massarenti and Massimiliano Mella. The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

work page arXiv
[12]

Identifiable Equivariant Networks are Layerwise Equivariant

[SMBK26] Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, and Kathlén Kohn. Iden- tifiable equivariant networks are layerwise equivariant.arXiv:2601.21645,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,

[UDBC25] Konstantin Usevich, Clara Dérand, Ricardo Borsoi, and Marianne Clausel. Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,

work page arXiv

[1] [1]

Robustness Verification of Polynomial Neural Networks

[ADM26] Yulia Alexandr, Hao Duan, and Guido Montúfar. Robustness verification of polyno- mial neural networks.arXiv:2602.06105,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Linear independence of powers for polynomials

[Cr˘a25] Alexandru Cr ˘aciun. Linear independence of powers for polynomials. arXiv:2507.10163,

work page arXiv

[3] [3]

Another generalization of mason’s abc-theorem.arXiv:0707.0434,

[dB07] Michiel de Bondt. Another generalization of mason’s abc-theorem.arXiv:0707.0434,

work page arXiv

[4] [4]

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

[DR26] Kevin Dao and Jose Israel Rodriguez. Minimal filling architectures of polynomial neural networks: Counterexamples, frontier search, and defects.arXiv:2605.09609,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

[FRWY24] Bella Finkel, Jose Israel Rodriguez, Chenxi Wu, and Thomas Yahl. Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

work page arXiv

[6] [6]

Most ReLU Networks Admit Identifiable Parameters

[GM26] Moritz Grillo and Guido Montúfar. Most relu networks admit identifiable parameters. arXiv:2605.03601,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

[HPL26] Yacoub Hendi, Daniel Persson, and Magdalena Larfors. The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

work page arXiv

[8] [8]

The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

[LFW+23] Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

work page arXiv

[9] [9]

Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization

[MCBK25] Giovanni Luca Marchetti, Erin Connelly, Paul Breiding, and Kathlén Kohn. Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization. arXiv:2512.21029,

work page arXiv

[10] [10]

Sequential Group Composition: A Window into the Mechanics of Deep Learning

[MKM+26] Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Mi- olane. Sequential group composition: A window into the mechanics of deep learning. arXiv:2602.03655,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

[MM25] Alex Massarenti and Massimiliano Mella. The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

work page arXiv

[12] [12]

Identifiable Equivariant Networks are Layerwise Equivariant

[SMBK26] Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, and Kathlén Kohn. Iden- tifiable equivariant networks are layerwise equivariant.arXiv:2601.21645,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,

[UDBC25] Konstantin Usevich, Clara Dérand, Ricardo Borsoi, and Marianne Clausel. Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,

work page arXiv