pith. sign in

arxiv: 2606.28464 · v1 · pith:Z4TTOKXAnew · submitted 2026-06-26 · 💻 cs.LG

Singular Learning and Occam's Razor in Deep Monomial Networks

Pith reviewed 2026-06-30 01:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords singular learning theorymonomial activationscritical pointsJacobian rank deficiencysubnetworksimplicit biasOccam's razor
0
0 comments X

The pith

For large activation degrees, criticality in deep monomial networks occurs precisely at subnetworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies critical points in the loss landscape of deep fully-connected networks that use monomial activations, defined as locations where the Jacobian of the parametrization has deficient rank. These points shape gradient-based optimization and are central to singular learning theory. By applying Mason's Theorem from polynomial algebra to the network's Jacobian, the work shows that when the monomial degree is high enough, every such critical point corresponds exactly to a subnetwork in which some neurons are inactive or redundant. This algebraic identification supplies a concrete mechanism for the observed tendency of these models to converge to simpler functions rather than more complex ones.

Core claim

For sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This characterization follows from applying Mason's Theorem to the Jacobian of the network parametrization in deep fully-connected monomial networks.

What carries the argument

Mason's Theorem applied to the Jacobian of the monomial network parametrization, which locates all rank deficiencies at subnetworks for large activation degree.

If this is right

  • Gradient flow is steered toward parameter values that deactivate or redundant-ize neurons.
  • The implicit bias favors functions realized by proper subnetworks over more complex realizations.
  • Architectural simplicity is enforced algebraically at the level of the parametrization map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same algebraic reduction may be tested numerically by tracking whether random initializations avoid non-subnetwork singularities under gradient descent.
  • If the pattern persists beyond monomials, it could link singular learning directly to effective model pruning during training.

Load-bearing premise

The precise identification of criticality with subnetworks holds only because the activations are monomials and Mason's Theorem applies to their polynomial Jacobian structure.

What would settle it

Finding even one parameter setting that is not a subnetwork yet produces a rank-deficient Jacobian for a sufficiently large monomial degree would falsify the claim.

read the original abstract

In the optimization of neural networks, gradient dynamics are influenced by critical points that arise from the model's architecture. These critical points occur where the Jacobian of the model's parametrization is rank-deficient, and are the most pronounced singularities studied in Singular Learning Theory. We investigate such points in deep fully-connected networks with monomial activations via tools from polynomial algebra such as Mason's Theorem. We show that, for sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This offers a mathematical perspective on the implicit bias in deep neural networks, explaining the tendency of these models to converge toward simpler functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in deep fully-connected networks using monomial activations, for sufficiently large activation degree, the points where the Jacobian of the parametrization is rank-deficient (critical points in the loss landscape) occur precisely at subnetwork configurations, i.e., where some neurons are inactive or redundant. This is derived via tools from polynomial algebra, notably Mason's Theorem applied to the Jacobian, and is positioned as explaining implicit bias toward simpler functions in deep networks.

Significance. If the central algebraic claim holds, the result supplies a precise characterization of singularities in a restricted but analytically tractable class of networks, directly linking singular learning theory to an Occam-like preference for subnetworks. The explicit use of Mason's Theorem to obtain an if-and-only-if statement is a technical strength that could seed further work on algebraic characterizations of criticality beyond the monomial case.

major comments (2)
  1. [§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.
  2. [p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.
minor comments (2)
  1. [§2.1] Notation for the monomial activation φ(x) = x^k is introduced without a clear statement of whether k is the same for all layers or allowed to differ; this affects the degree counting in the composed Jacobian.
  2. [Abstract and §4] The abstract states the result for 'deep' networks, yet all explicit calculations appear to be carried out for depth 2 before the general case is claimed; a short remark on the inductive step would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications in the revision.

read point-by-point responses
  1. Referee: [§3, Theorem 2] §3 (Jacobian rank analysis) and Theorem 2: The reduction of rank(J) < full rank to an algebraic condition on monomial parameters is asserted to be equivalent to subnetwork loci via Mason's Theorem, but the manuscript does not verify that the coprimality or degree conditions of the theorem continue to hold uniformly for the composed polynomials when depth d ≥ 2; the entries of J are themselves high-degree polynomials whose degrees scale with depth, so extra roots of minors may exist that do not force any neuron to zero.

    Authors: We appreciate the referee drawing attention to the need for explicit verification of Mason's Theorem hypotheses under composition. The monomial activations ensure that distinct monomials remain coprime (gcd 1) at every layer, and this property is preserved under the network composition independently of depth because no new common polynomial factors are introduced by the monomial structure. Consequently, the if-and-only-if equivalence holds without extraneous roots affecting the rank condition. We will add a short supporting lemma in the revised §3 to make this verification explicit for arbitrary depth. revision: yes

  2. Referee: [p. 7, Definition 3.1] Definition of 'sufficiently large' degree (p. 7): The statement that criticality occurs 'precisely' at subnetworks is conditioned on the activation degree exceeding an unspecified threshold; without an explicit lower bound or a check that the bound is independent of depth and width, the claim cannot be confirmed to apply to any concrete finite network.

    Authors: The referee correctly notes that an explicit bound would make the result more concrete. In the revision we will derive and state an explicit lower bound on the activation degree (a finite number depending on width and depth) obtained by ensuring the degree exceeds the maximum possible degree of any minor of the Jacobian; this bound is independent of the specific parameter values but depends on the fixed architecture. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external Mason's Theorem to monomial Jacobian

full rationale

The paper presents a direct algebraic argument that rank deficiency of the parametrization Jacobian occurs precisely at subnetwork loci for large monomial degree, by invoking Mason's Theorem on the relevant polynomials. This step is not self-definitional, does not rename a fitted quantity as a prediction, and does not rest on a load-bearing self-citation whose content is itself unverified. The central claim therefore retains independent mathematical content from the cited algebraic theorem and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Mason's Theorem to the Jacobian rank condition and on the monomial activation form; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Mason's Theorem from polynomial algebra applies directly to the Jacobian of the monomial network parametrization to characterize rank deficiency.
    Invoked to locate critical points; location not specified beyond the abstract's mention of the tool.

pith-pipeline@v0.9.1-grok · 5648 in / 1126 out tokens · 31711 ms · 2026-06-30T01:10:24.172337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Robustness Verification of Polynomial Neural Networks

    [ADM26] Yulia Alexandr, Hao Duan, and Guido Montúfar. Robustness verification of polyno- mial neural networks.arXiv:2602.06105,

  2. [2]

    Linear independence of powers for polynomials

    [Cr˘a25] Alexandru Cr ˘aciun. Linear independence of powers for polynomials. arXiv:2507.10163,

  3. [3]

    Another generalization of mason’s abc-theorem.arXiv:0707.0434,

    [dB07] Michiel de Bondt. Another generalization of mason’s abc-theorem.arXiv:0707.0434,

  4. [4]

    Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

    [DR26] Kevin Dao and Jose Israel Rodriguez. Minimal filling architectures of polynomial neural networks: Counterexamples, frontier search, and defects.arXiv:2605.09609,

  5. [5]

    Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

    [FRWY24] Bella Finkel, Jose Israel Rodriguez, Chenxi Wu, and Thomas Yahl. Activation thresh- olds and expressiveness of polynomial neural networks.arXiv:2408.04569,

  6. [6]

    Most ReLU Networks Admit Identifiable Parameters

    [GM26] Moritz Grillo and Guido Montúfar. Most relu networks admit identifiable parameters. arXiv:2605.03601,

  7. [7]

    The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

    [HPL26] Yacoub Hendi, Daniel Persson, and Magdalena Larfors. The geometry of polynomial group convolutional neural networks.arXiv:2603.29566,

  8. [8]

    The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

    [LFW+23] Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The lo- cal learning coefficient: A singularity-aware complexity measure.arXiv:2308.12108,

  9. [9]

    Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization

    [MCBK25] Giovanni Luca Marchetti, Erin Connelly, Paul Breiding, and Kathlén Kohn. Criti- cal points of degenerate metrics on algebraic varieties: A tale of overparametrization. arXiv:2512.21029,

  10. [10]

    Sequential Group Composition: A Window into the Mechanics of Deep Learning

    [MKM+26] Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Mi- olane. Sequential group composition: A window into the mechanics of deep learning. arXiv:2602.03655,

  11. [11]

    The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

    [MM25] Alex Massarenti and Massimiliano Mella. The alexander-hirschowitz theorem for neu- rovarieties.arXiv:2511.19703,

  12. [12]

    Identifiable Equivariant Networks are Layerwise Equivariant

    [SMBK26] Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, and Kathlén Kohn. Iden- tifiable equivariant networks are layerwise equivariant.arXiv:2601.21645,

  13. [13]

    Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,

    [UDBC25] Konstantin Usevich, Clara Dérand, Ricardo Borsoi, and Marianne Clausel. Identifia- bility of deep polynomial neural networks.arXiv:2506.17093,