Generalization Below the Edge of Stability: The Role of Data Geometry
Pith reviewed 2026-05-18 05:13 UTC · model grok-4.3
The pith
Data geometry controls whether gradient descent below the edge of stability generalizes or memorizes in two-layer ReLU networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For data supported on a mixture of low-dimensional balls the paper derives generalization bounds that provably adapt to intrinsic dimension; for isotropic distributions whose probability mass varies in concentration toward the unit sphere it derives a spectrum of bounds that worsen as concentration increases. These instantiate the principle that data harder to shatter with respect to ReLU activation thresholds induces gradient descent to learn shared patterns and therefore to generalize, whereas data that is easily shattered, such as data supported on the sphere, induces memorization.
What carries the argument
The difficulty of shattering the data distribution with respect to the activation thresholds of the ReLU neurons, which governs whether gradient descent selects shared-pattern representations or memorizing solutions.
If this is right
- Generalization bounds adapt to the intrinsic dimension when data lies on mixtures of low-dimensional balls.
- Generalization rates worsen continuously as isotropic data concentrates more mass on the sphere.
- The shattering property with ReLU thresholds predicts when gradient descent will favor solutions that capture shared patterns across examples.
- The same geometric principle accounts for previously observed differences between structured and unstructured data in empirical studies of overparameterized networks.
Where Pith is reading between the lines
- Measuring how easily a real dataset can be shattered by random hyperplanes near the origin may give a practical diagnostic for whether a given architecture and optimizer will generalize well on that data.
- The same shattering lens could be applied to deeper networks or to activations other than ReLU to test whether the principle extends beyond the two-layer case.
- Datasets engineered to increase or decrease shattering difficulty could be used to create controlled benchmarks that isolate the effect of data geometry on implicit bias.
Load-bearing premise
The analysis assumes that training dynamics remain below the edge of stability throughout optimization for the two-layer ReLU networks on the families of distributions considered.
What would settle it
Training a two-layer ReLU network below the edge of stability on data supported exactly on the unit sphere and observing that test error remains low rather than rising to the level expected for memorization would contradict the predicted favoritism toward memorization.
read the original abstract
Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents theoretical analysis of generalization in overparameterized two-layer ReLU networks trained with gradient descent below the edge of stability. For data distributions supported on mixtures of low-dimensional balls, it derives generalization bounds that adapt to the intrinsic dimension. For isotropic distributions with varying degrees of concentration towards the unit sphere, it provides a spectrum of bounds that worsen as the mass concentrates on the sphere. These instantiate the principle that data harder to shatter with respect to ReLU activation thresholds leads to representations capturing shared patterns and good generalization, whereas easily shattered data (e.g., supported on the sphere) favors memorization.
Significance. If the derivations hold, this work offers a unifying theoretical framework connecting data geometry to the implicit bias of GD below the EOS, consolidating disparate empirical findings on generalization. The explicit bounds for concrete data families and the parameter-free character of the core claims (as indicated by the axiom ledger) are strengths.
major comments (2)
- [§4] §4 (isotropic family): The spectrum of bounds is derived under the standing assumption that training remains below the edge of stability. For distributions with strong spherical concentration this assumption is load-bearing; if progressive sharpening pushes the sharpness above 2/η, the implicit-bias analysis and the claimed deterioration of rates no longer apply.
- [§3] §3 (mixture-of-balls case): The adaptation of the generalization bound to intrinsic dimension is stated to follow from the shattering properties of the data with respect to ReLU thresholds. The manuscript should make explicit how the mixture radii and centers enter the shatterability measure and the final bound, so that the dependence on intrinsic dimension is verifiable rather than implicit.
minor comments (2)
- [Abstract] Abstract: a one-sentence indication of the concrete rates (e.g., dependence on intrinsic dimension or concentration parameter) would help readers assess the strength of the results at a glance.
- [Notation] Notation section: ensure that the symbols for the concentration parameter and the intrinsic dimension are introduced once and used uniformly in both families of distributions.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [§4] §4 (isotropic family): The spectrum of bounds is derived under the standing assumption that training remains below the edge of stability. For distributions with strong spherical concentration this assumption is load-bearing; if progressive sharpening pushes the sharpness above 2/η, the implicit-bias analysis and the claimed deterioration of rates no longer apply.
Authors: We agree that the results in §4 are derived under the explicit standing assumption that training dynamics remain below the edge of stability. This assumption is necessary for the implicit-bias characterization to apply, and we note that for distributions with strong spherical concentration the analysis is indeed conditional on this regime. If progressive sharpening were to drive the sharpness above 2/η, the dynamics would enter a different regime outside the scope of the current bounds. We will add a clarifying remark in §4 emphasizing the conditional nature of the deterioration in generalization rates and referencing the relevant literature on sharpness evolution. revision: partial
-
Referee: [§3] §3 (mixture-of-balls case): The adaptation of the generalization bound to intrinsic dimension is stated to follow from the shattering properties of the data with respect to ReLU thresholds. The manuscript should make explicit how the mixture radii and centers enter the shatterability measure and the final bound, so that the dependence on intrinsic dimension is verifiable rather than implicit.
Authors: We thank the referee for this suggestion. The shatterability measure is defined with respect to ReLU threshold functions, and the mixture radii control the local covering numbers (hence the effective dimension), while the centers determine the minimal separation that affects global shattering capacity. We will revise the relevant paragraphs in §3 to explicitly trace the dependence of the shatterability coefficient on the ball radii and centers, and show how these parameters propagate into the final generalization bound, rendering the adaptation to intrinsic dimension fully explicit and directly verifiable. revision: yes
Circularity Check
No significant circularity; derivation is self-contained theoretical analysis.
full rationale
The paper derives generalization bounds for overparameterized two-layer ReLU networks trained below the edge of stability, for data on mixtures of low-dimensional balls (adapting to intrinsic dimension) and isotropic distributions with varying spherical concentration (bounds deteriorating as mass concentrates on the sphere). These instantiate an interpretive unifying principle about shattering difficulty and implicit bias toward shared patterns versus memorization. No quoted equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the results follow from analysis of dynamics and geometry under explicit assumptions. The derivation is independent and self-contained against the stated data families and below-EOS regime, with no evidence of renaming known results or smuggling ansatzes via citation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Data distributions belong to the family of mixtures of low-dimensional balls or isotropic distributions with controllable concentration toward the unit sphere.
- domain assumption Gradient descent training of the two-layer ReLU networks remains below the edge of stability.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.