Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization
Pith reviewed 2026-05-25 09:20 UTC · model grok-4.3
The pith
Gradient descent with random initialization enters a strongly convex and smooth region and converges linearly to the global optimum for one-hidden-layer quadratic neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity.
What carries the argument
The local region with strong convexity and smoothness reached by gradient descent iterates from random initialization.
Load-bearing premise
The proof relies on quadratic activation functions and the under-parameterized regime to guarantee the existence of a strongly convex and smooth local region reachable by random initialization.
What would settle it
A counterexample where gradient descent with random initialization does not enter the strongly convex region or fails to converge linearly for quadratic activations in the under-parameterized regime would disprove the claim.
read the original abstract
Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper considers learning a one-hidden-layer neural network with quadratic activations in the under-parameterized regime (fewer hidden units than input dimension). It claims that gradient descent with random initialization enters a local region enjoying strong convexity and smoothness within a few iterations, after which it converges linearly to a globally optimal model at near-optimal sample complexity; the claims are said to be corroborated by experiments.
Significance. If the result holds, the work would be significant for providing a concrete mechanism (rapid entry into a strongly convex/smooth basin) that explains linear convergence of GD on a non-convex neural-network objective, together with near-optimal sample complexity, in a simplified but non-trivial setting.
major comments (1)
- [Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.
Simulated Author's Rebuttal
We thank the referee for their careful review and for identifying a key presentational issue with the central claim in the abstract. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.
Authors: The referee correctly observes that the provided abstract contains no derivations, no explicit statements of initialization variance or data-distribution assumptions, and no argument showing that the iterates reach a strongly convex/smooth region in few steps. Because only the abstract is available, the manuscript as presented here indeed supplies none of these elements. We therefore agree that the abstract, standing alone, does not substantiate the load-bearing claim. To remedy this, we will revise the abstract to state the key assumptions on initialization variance and data distribution and to give a high-level indication of the argument establishing entry into the strongly convex region. revision: yes
- The full manuscript body containing the actual derivations, proofs, and technical arguments is not provided, so we cannot reference or reproduce those details in this response.
Circularity Check
No circularity detectable; only abstract available
full rationale
Only the abstract is provided, which states the claim of GD entering a strongly convex/smooth region then converging linearly but supplies no equations, parameter fits, self-citations, or derivation steps. No load-bearing reduction to inputs by construction can be exhibited because no derivation chain exists in the text. This is the normal honest outcome when the paper is self-contained against external benchmarks and no internal circularity is visible.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The loss landscape for quadratic activations in under-parameterized regime has a strongly convex region reachable by GD from random init.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.