Generalization Below the Edge of Stability: The Role of Data Geometry

Alexander Cloninger; Rahul Parhi; Tongtong Liang; Yu-Xiang Wang

arxiv: 2510.18120 · v3 · submitted 2025-10-20 · 📊 stat.ML · cs.LG

Generalization Below the Edge of Stability: The Role of Data Geometry

Tongtong Liang , Alexander Cloninger , Rahul Parhi , Yu-Xiang Wang This is my paper

Pith reviewed 2026-05-18 05:13 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords generalization boundsdata geometryReLU networksedge of stabilityimplicit biasoverparameterized networksshatteringlow-dimensional data

0 comments

The pith

Data geometry controls whether gradient descent below the edge of stability generalizes or memorizes in two-layer ReLU networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that for overparameterized two-layer ReLU networks trained below the edge of stability, the geometry of the training distribution determines the implicit bias of gradient descent. When data lies on mixtures of low-dimensional balls, derived generalization bounds adapt to the intrinsic dimension of those balls. When data follows isotropic distributions whose mass concentrates more strongly toward the sphere, the same style of bounds deteriorate in a graded way. The unifying mechanism is the ease with which the data can be shattered by the activation thresholds of the ReLU neurons: distributions that resist such shattering push the network toward representations that capture shared structure across examples and therefore generalize, while easily shattered distributions allow memorization. These results give a geometric account for why certain data sets produce good generalization even in heavily overparameterized regimes.

Core claim

For data supported on a mixture of low-dimensional balls the paper derives generalization bounds that provably adapt to intrinsic dimension; for isotropic distributions whose probability mass varies in concentration toward the unit sphere it derives a spectrum of bounds that worsen as concentration increases. These instantiate the principle that data harder to shatter with respect to ReLU activation thresholds induces gradient descent to learn shared patterns and therefore to generalize, whereas data that is easily shattered, such as data supported on the sphere, induces memorization.

What carries the argument

The difficulty of shattering the data distribution with respect to the activation thresholds of the ReLU neurons, which governs whether gradient descent selects shared-pattern representations or memorizing solutions.

If this is right

Generalization bounds adapt to the intrinsic dimension when data lies on mixtures of low-dimensional balls.
Generalization rates worsen continuously as isotropic data concentrates more mass on the sphere.
The shattering property with ReLU thresholds predicts when gradient descent will favor solutions that capture shared patterns across examples.
The same geometric principle accounts for previously observed differences between structured and unstructured data in empirical studies of overparameterized networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Measuring how easily a real dataset can be shattered by random hyperplanes near the origin may give a practical diagnostic for whether a given architecture and optimizer will generalize well on that data.
The same shattering lens could be applied to deeper networks or to activations other than ReLU to test whether the principle extends beyond the two-layer case.
Datasets engineered to increase or decrease shattering difficulty could be used to create controlled benchmarks that isolate the effect of data geometry on implicit bias.

Load-bearing premise

The analysis assumes that training dynamics remain below the edge of stability throughout optimization for the two-layer ReLU networks on the families of distributions considered.

What would settle it

Training a two-layer ReLU network below the edge of stability on data supported exactly on the unit sphere and observing that test error remains low rather than rising to the level expected for memorization would contradict the predicted favoritism toward memorization.

read the original abstract

Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links data geometry to generalization via new bounds for ReLU nets below the edge of stability, but the below-EOS assumption may not hold for sphere-concentrated data.

read the letter

The punchline is that this paper gives a theoretical account of how data geometry affects generalization in two-layer ReLU networks when trained below the edge of stability. They derive bounds that adapt to the intrinsic dimension for data on mixtures of low-dimensional balls, and a spectrum of bounds for isotropic data that get worse as probability mass concentrates on the sphere. This backs up their claim that data harder to shatter with ReLU thresholds leads to shared pattern representations and good generalization, while sphere-supported data leads to memorization. What stands out as new is these particular bounds and the way they organize the role of geometry. The paper does well in trying to explain disparate empirical findings with a single principle based on shatterability. The soft spots are around the assumption that training stays below the edge of stability. The stress-test concern is valid here: for strongly concentrated spherical data, the sharpness might grow and cross the threshold, so the analysis wouldn't cover those cases. If the paper doesn't show that the assumption holds or give conditions for it, the results for the isotropic family are limited in scope. The citation pattern and the focus on data geometry rather than post-training quantities seem fine, with no obvious circularity. The math is presented as derivations from the data distributions, which is good if the full proofs hold up. This is the kind of paper for people studying theoretical aspects of overparameterized networks and implicit regularization. It could be of interest to those thinking about how to choose or design datasets for better generalization. I think it should go to peer review. The contributions are specific and the unifying principle is worth discussing, so a referee can help tighten the assumptions and check the derivations.

Referee Report

2 major / 2 minor

Summary. The manuscript presents theoretical analysis of generalization in overparameterized two-layer ReLU networks trained with gradient descent below the edge of stability. For data distributions supported on mixtures of low-dimensional balls, it derives generalization bounds that adapt to the intrinsic dimension. For isotropic distributions with varying degrees of concentration towards the unit sphere, it provides a spectrum of bounds that worsen as the mass concentrates on the sphere. These instantiate the principle that data harder to shatter with respect to ReLU activation thresholds leads to representations capturing shared patterns and good generalization, whereas easily shattered data (e.g., supported on the sphere) favors memorization.

Significance. If the derivations hold, this work offers a unifying theoretical framework connecting data geometry to the implicit bias of GD below the EOS, consolidating disparate empirical findings on generalization. The explicit bounds for concrete data families and the parameter-free character of the core claims (as indicated by the axiom ledger) are strengths.

major comments (2)

[§4] §4 (isotropic family): The spectrum of bounds is derived under the standing assumption that training remains below the edge of stability. For distributions with strong spherical concentration this assumption is load-bearing; if progressive sharpening pushes the sharpness above 2/η, the implicit-bias analysis and the claimed deterioration of rates no longer apply.
[§3] §3 (mixture-of-balls case): The adaptation of the generalization bound to intrinsic dimension is stated to follow from the shattering properties of the data with respect to ReLU thresholds. The manuscript should make explicit how the mixture radii and centers enter the shatterability measure and the final bound, so that the dependence on intrinsic dimension is verifiable rather than implicit.

minor comments (2)

[Abstract] Abstract: a one-sentence indication of the concrete rates (e.g., dependence on intrinsic dimension or concentration parameter) would help readers assess the strength of the results at a glance.
[Notation] Notation section: ensure that the symbols for the concentration parameter and the intrinsic dimension are introduced once and used uniformly in both families of distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [§4] §4 (isotropic family): The spectrum of bounds is derived under the standing assumption that training remains below the edge of stability. For distributions with strong spherical concentration this assumption is load-bearing; if progressive sharpening pushes the sharpness above 2/η, the implicit-bias analysis and the claimed deterioration of rates no longer apply.

Authors: We agree that the results in §4 are derived under the explicit standing assumption that training dynamics remain below the edge of stability. This assumption is necessary for the implicit-bias characterization to apply, and we note that for distributions with strong spherical concentration the analysis is indeed conditional on this regime. If progressive sharpening were to drive the sharpness above 2/η, the dynamics would enter a different regime outside the scope of the current bounds. We will add a clarifying remark in §4 emphasizing the conditional nature of the deterioration in generalization rates and referencing the relevant literature on sharpness evolution. revision: partial
Referee: [§3] §3 (mixture-of-balls case): The adaptation of the generalization bound to intrinsic dimension is stated to follow from the shattering properties of the data with respect to ReLU thresholds. The manuscript should make explicit how the mixture radii and centers enter the shatterability measure and the final bound, so that the dependence on intrinsic dimension is verifiable rather than implicit.

Authors: We thank the referee for this suggestion. The shatterability measure is defined with respect to ReLU threshold functions, and the mixture radii control the local covering numbers (hence the effective dimension), while the centers determine the minimal separation that affects global shattering capacity. We will revise the relevant paragraphs in §3 to explicitly trace the dependence of the shatterability coefficient on the ball radii and centers, and show how these parameters propagate into the final generalization bound, rendering the adaptation to intrinsic dimension fully explicit and directly verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained theoretical analysis.

full rationale

The paper derives generalization bounds for overparameterized two-layer ReLU networks trained below the edge of stability, for data on mixtures of low-dimensional balls (adapting to intrinsic dimension) and isotropic distributions with varying spherical concentration (bounds deteriorating as mass concentrates on the sphere). These instantiate an interpretive unifying principle about shattering difficulty and implicit bias toward shared patterns versus memorization. No quoted equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the results follow from analysis of dynamics and geometry under explicit assumptions. The derivation is independent and self-contained against the stated data families and below-EOS regime, with no evidence of renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the form of the data distributions and the training regime; no free parameters are introduced or fitted because the work derives bounds rather than fits models to data.

axioms (2)

domain assumption Data distributions belong to the family of mixtures of low-dimensional balls or isotropic distributions with controllable concentration toward the unit sphere.
These are the concrete settings for which the adaptive and spectrum bounds are stated.
domain assumption Gradient descent training of the two-layer ReLU networks remains below the edge of stability.
This regime is required for the implicit bias toward shared patterns versus memorization to hold as described.

pith-pipeline@v0.9.0 · 5727 in / 1530 out tokens · 52541 ms · 2026-05-18T05:13:28.848521+00:00 · methodology

Generalization Below the Edge of Stability: The Role of Data Geometry

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)