arxiv: 2603.04807 · v2 · submitted 2026-03-05 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Does Sparse Connectivity Improve Generalization? Convolutional Networks Below the Edge of Stability

Tongtong Liang , Esha Singh , Rahul Parhi , Alexander Cloninger , Yu-Xiang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:05 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords generalization boundsedge of stabilitysparse connectivityconvolutional networkspatch geometrytwo-layer ReLU networksoverparameterized modelsspherical inputs

0 comments

The pith

Sparse connectivity produces non-vacuous generalization bounds below the edge of stability for two-layer ReLU networks on spherical inputs where fully-connected networks fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Gradient descent typically runs at the edge of stability where the largest Hessian eigenvalue sits near a step-size threshold. In fully-connected networks this regime produces vacuous generalization bounds on high-dimensional spherical data. Sparse connectivity changes the picture because the network only sees low-dimensional patches; the stability condition then constrains the model according to the geometry of the training patch collection rather than the full input dimension. When receptive fields are small and the patches exhibit low-dimensional structure, as they do for natural images under standard convolutional designs, the resulting bounds are non-vacuous. The same framework shows that unstructured patch collections allow overfitting despite stability, reproducing the dense-network failure mode.

Core claim

For two-layer ReLU networks with sparse connectivity, the edge-of-stability condition on the largest Hessian eigenvalue imposes a constraint whose effectiveness is determined by the low-dimensional geometry of the multiset of patches extracted from the training data. When receptive fields remain small relative to ambient dimension and the patch collection carries sufficient geometric structure, this constraint produces non-vacuous generalization bounds precisely in the high-dimensional spherical regime where fully-connected networks provably fail. The analysis also identifies the complementary failure mode in which unstructured patches render the constraint ineffective, permitting overfit.

What carries the argument

The stability-induced constraint on the geometry of the training patch collection under sparse receptive fields.

Load-bearing premise

The collection of training patches possesses low-dimensional geometric structure that renders the stability constraint effective.

What would settle it

Compute the generalization bound explicitly for a sparse network trained on spherical data whose patches have been replaced by unstructured random vectors of the same dimension; if the bound stays non-vacuous, the dependence on patch geometry is refuted.

read the original abstract

Gradient descent on overparameterized neural networks typically operates at the Edge of Stability (EoS), where the largest Hessian eigenvalue hovers around a step-size-dependent threshold. We study how sparse connectivity changes generalization below this threshold in two-layer ReLU networks. Prior results have shown that for fully-connected networks (FCNs), generalization guarantees in this regime degrade and become vacuous on high-dimensional spherical inputs. Our analysis reveals that sparse connectivity fundamentally alters this picture. Under sparse connectivity, the network processes a collection of low-dimensional patches rather than the full input vector, so the effective constraint imposed by the stability condition is governed by the geometry of the training patch collection. We prove that when the receptive fields are small relative to the ambient dimension, the effective constraint yields non-vacuous generalization bounds in precisely the spherical regime where FCNs provably fail. The same framework also reveals a contrasting failure mode: if the patch collection lacks geometric structure, the constraint becomes unable to prevent overfitting. We corroborate this theory by analyzing the patch geometry of natural images, showing that standard convolutional designs produce patch multiset with low-dimensional structure that facilitates generalization. This provides a principled explanation for the generalization advantage of convolutional networks. Thus, our analysis yields a unified framework that identifies how architecture, data geometry, and gradient descent jointly govern generalization performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse connectivity can produce non-vacuous bounds below the edge of stability by acting on low-dimensional patch geometry rather than full inputs, but the paper stops short of showing the bounds are small in practice.

read the letter

The main point is that sparse connectivity changes generalization below the edge of stability for two-layer ReLU nets. On spherical high-dimensional inputs, fully connected networks get vacuous bounds, but small receptive fields let the stability condition constrain the network through the geometry of the training patches instead. When those patches have low-dimensional structure, the resulting bounds stay non-vacuous. The paper also notes the reverse case: unstructured patches remove the protection and allow overfitting. This gives a clean account of why convolutional designs help on natural data. The derivation ties the effective complexity directly to the patch multiset, which is the clearest new piece. It extends prior edge-of-stability work without obvious internal contradictions and keeps the assumptions explicit. The analysis of natural-image patches is a reasonable check that standard convolutional receptive fields do capture the needed structure. The main gap is quantitative. The theory predicts improvement, yet the empirical section does not report the actual numerical value of the generalization bound or the precise effective dimension on datasets such as CIFAR-10. Without those numbers it remains unclear whether the bounds are tight enough to be useful or still loose in absolute terms. This work is for theorists who track generalization results in overparameterized models and the interaction between architecture and training dynamics. Readers already familiar with edge-of-stability analyses will see the extension immediately. The central argument is coherent on its own terms, so the paper deserves a serious referee even if the quantitative checks need strengthening in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that in two-layer ReLU networks trained by gradient descent below the Edge of Stability, sparse connectivity (as in convolutional networks) yields non-vacuous generalization bounds by processing low-dimensional patches whose geometry imposes an effective stability constraint. This holds when receptive fields are small relative to ambient dimension and the training patch multiset has low-dimensional structure, in contrast to fully-connected networks on high-dimensional spherical inputs where bounds are vacuous. The authors derive this via a framework linking architecture, patch geometry, and stability, identify a failure mode when patches lack structure, and support the theory with analysis of natural-image patch geometry.

Significance. If the central claim holds with verifiable non-vacuous bounds, the work supplies a principled mechanism explaining the generalization advantage of convolutional over fully-connected networks below EoS. It unifies architecture sparsity, data geometry, and optimization dynamics, showing how the stability condition reduces effective complexity via patch multiset properties. This could guide architecture choices for high-dimensional data and highlights conditions under which sparsity fails to help.

major comments (3)

[Abstract / main theorem] Abstract and proof sketch (main theorem section): the derivation that small receptive fields plus patch geometry produce a stability constraint with Rademacher complexity o(1) in the spherical regime lacks explicit error bounds, verification of the central reduction steps, and confirmation that the bound is quantitatively non-vacuous rather than merely qualitatively improved.
[Empirical analysis of patch geometry] Empirical corroboration section: the analysis of natural-image patch geometry does not report the numerical value of the resulting generalization bound or the precise scaling of effective dimension for standard datasets such as CIFAR-10, preventing verification that the bound is actually non-vacuous in the regime where FCN bounds fail.
[Assumptions / framework setup] Assumptions on patch structure: the claim that the training patch collection possesses 'sufficient low-dimensional geometric structure' to make the stability constraint effective requires a quantitative definition or measure of sufficiency, as the current formulation leaves the condition under which the bound becomes non-vacuous imprecise.

minor comments (2)

[Notation / definitions] Notation for the patch multiset and its geometric quantities (e.g., covering numbers) could be introduced with a small concrete example or diagram to improve readability.
[Related work] Ensure the discussion of prior FCN results below EoS includes all relevant citations for the vacuous-bound regime.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested clarifications and additions.

read point-by-point responses

Referee: [Abstract / main theorem] Abstract and proof sketch (main theorem section): the derivation that small receptive fields plus patch geometry produce a stability constraint with Rademacher complexity o(1) in the spherical regime lacks explicit error bounds, verification of the central reduction steps, and confirmation that the bound is quantitatively non-vacuous rather than merely qualitatively improved.

Authors: We agree that the main theorem section would benefit from greater explicitness. In the revision we will expand the proof sketch to include explicit error bounds and constants for each reduction step, and add a dedicated appendix containing the complete derivation. This will confirm that the Rademacher complexity is o(1) and that the resulting generalization bound is quantitatively non-vacuous (rather than only qualitatively improved) in the spherical regime. revision: yes
Referee: [Empirical analysis of patch geometry] Empirical corroboration section: the analysis of natural-image patch geometry does not report the numerical value of the resulting generalization bound or the precise scaling of effective dimension for standard datasets such as CIFAR-10, preventing verification that the bound is actually non-vacuous in the regime where FCN bounds fail.

Authors: We will revise the empirical corroboration section to report the explicit numerical values of the generalization bounds obtained from the patch-geometry analysis on CIFAR-10 (and other standard datasets), together with the precise scaling of the effective dimension. These additions will enable direct verification that the bounds are non-vacuous precisely where the corresponding FCN bounds become vacuous. revision: yes
Referee: [Assumptions / framework setup] Assumptions on patch structure: the claim that the training patch collection possesses 'sufficient low-dimensional geometric structure' to make the stability constraint effective requires a quantitative definition or measure of sufficiency, as the current formulation leaves the condition under which the bound becomes non-vacuous imprecise.

Authors: We agree that the notion of sufficiency requires a quantitative formulation. In the revised manuscript we will introduce a precise measure of low-dimensional structure (based on the covering number / intrinsic dimension of the patch multiset) and state the explicit condition (effective dimension scaling sufficiently slower than ambient dimension) under which the stability constraint yields a non-vacuous bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from stability condition on patch geometry

full rationale

The central derivation applies the Edge-of-Stability constraint to the multiset of low-dimensional patches induced by sparse receptive fields, then bounds the Rademacher complexity (or covering numbers) of the resulting function class directly from the geometric properties of that patch collection. This step is stated as a mathematical implication under the assumption that receptive-field size is small relative to ambient dimension; the bound is therefore obtained from the input geometry rather than from any parameter fitted inside the paper or from a self-citation chain. The subsequent empirical check that natural-image patches possess the required low-dimensional structure is presented only as corroboration and does not enter the proof. No self-definitional, fitted-input, or uniqueness-imported steps appear in the load-bearing chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mathematical analysis of the stability constraint under sparse connectivity and the geometric properties of patches.

axioms (2)

domain assumption Two-layer ReLU networks trained with gradient descent below the edge of stability
The regime in which the analysis is conducted.
domain assumption Inputs lie on the high-dimensional sphere
The setting in which fully connected networks provably fail.

pith-pipeline@v0.9.0 · 5547 in / 1290 out tokens · 62824 ms · 2026-05-15T16:05:10.402693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1: K sum |v_k| ||w_k|| g_{D,S}(w_k/||w_k||, b_k/||w_k||) ≤ 1/η − 1/2 + (R+1) sqrt(2L(θ)) where g involves P(u^T X_S > t)^2 * E[...] * sqrt(1+||E||^2)
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.2: generalization gap ≲_d poly(d,A,J,M) n^{-(d-m)(d+3)/2(3d^2−md+3d−3m)} for Uniform(S^{d−1}) when m < d(d−3)/(d+3)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.