Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Shuhao Xia; Yuanming Shi

arxiv: 1907.06594 · v2 · pith:LRHCJJ4Bnew · submitted 2019-07-04 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Shuhao Xia , Yuanming Shi This is my paper

Pith reviewed 2026-05-25 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML

keywords one-hidden-layer neural networksquadratic activationsgradient descentrandom initializationnon-convex optimizationprovable convergenceunder-parameterized regimelinear convergence

0 comments

The pith

Gradient descent with random initialization enters a strongly convex and smooth region and converges linearly to the global optimum for one-hidden-layer quadratic neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the training of one-hidden-layer neural networks with quadratic activations in the under-parameterized regime. It demonstrates that gradient descent starting from random initialization quickly reaches a local region with strong convexity and smoothness properties. From there, the iterates converge to the global optimum at a linear rate. The approach achieves this with sample complexity that is near optimal, offering a provable method for a non-convex problem.

Core claim

For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity.

What carries the argument

The local region with strong convexity and smoothness reached by gradient descent iterates from random initialization.

Load-bearing premise

The proof relies on quadratic activation functions and the under-parameterized regime to guarantee the existence of a strongly convex and smooth local region reachable by random initialization.

What would settle it

A counterexample where gradient descent with random initialization does not enter the strongly convex region or fails to converge linearly for quadratic activations in the under-parameterized regime would disprove the claim.

read the original abstract

Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract claims quick entry to a strongly convex region for GD on under-parameterized quadratic one-hidden-layer nets, but the proof steps are uncheckable.

read the letter

The headline result here is that random-initialization gradient descent on a one-hidden-layer network with quadratic activations, in the under-parameterized regime, enters a local region of strong convexity and smoothness after a few steps and then converges linearly to a global optimum at near-optimal sample complexity. That is the punchline from the abstract. The specific combination of quadratic activations plus the under-parameterized setting plus the claim that random init reliably reaches the nice region looks like the part that is new relative to earlier work on non-convex landscape analysis. Mentioning experiments that corroborate the theory is also a positive step; at least the authors tried to close the loop between analysis and practice. The central limitation is obvious and unavoidable: only the abstract is available, so the derivation that establishes both the quick entry into the convex region and the strong-convexity/smoothness parameters cannot be inspected. The argument rests on the quadratic activation allowing a particular local geometry and on the hidden-unit count being smaller than the input dimension; without seeing how those assumptions are used in the proof, it is impossible to judge whether the linear-rate claim holds or whether the sample complexity is truly near-optimal. The data assumptions and the precise initialization variance are also left unspecified in what we have. This work is aimed at readers who follow theoretical analyses of gradient methods on simplified neural-network models. Someone already working on non-convex to locally-convex transitions might want to see the details if they exist, but the current version does not give enough to evaluate the contribution. I would not send it to peer review on the basis of the abstract alone; the authors would need to supply a full manuscript with verifiable steps before it merits referee time.

Referee Report

1 major / 0 minor

Summary. The paper considers learning a one-hidden-layer neural network with quadratic activations in the under-parameterized regime (fewer hidden units than input dimension). It claims that gradient descent with random initialization enters a local region enjoying strong convexity and smoothness within a few iterations, after which it converges linearly to a globally optimal model at near-optimal sample complexity; the claims are said to be corroborated by experiments.

Significance. If the result holds, the work would be significant for providing a concrete mechanism (rapid entry into a strongly convex/smooth basin) that explains linear convergence of GD on a non-convex neural-network objective, together with near-optimal sample complexity, in a simplified but non-trivial setting.

major comments (1)

[Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their careful review and for identifying a key presentational issue with the central claim in the abstract. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.

Authors: The referee correctly observes that the provided abstract contains no derivations, no explicit statements of initialization variance or data-distribution assumptions, and no argument showing that the iterates reach a strongly convex/smooth region in few steps. Because only the abstract is available, the manuscript as presented here indeed supplies none of these elements. We therefore agree that the abstract, standing alone, does not substantiate the load-bearing claim. To remedy this, we will revise the abstract to state the key assumptions on initialization variance and data distribution and to give a high-level indication of the argument establishing entry into the strongly convex region. revision: yes

standing simulated objections not resolved

The full manuscript body containing the actual derivations, proofs, and technical arguments is not provided, so we cannot reference or reproduce those details in this response.

Circularity Check

0 steps flagged

No circularity detectable; only abstract available

full rationale

Only the abstract is provided, which states the claim of GD entering a strongly convex/smooth region then converging linearly but supplies no equations, parameter fits, self-citations, or derivation steps. No load-bearing reduction to inputs by construction can be exhibited because no derivation chain exists in the text. This is the normal honest outcome when the paper is self-contained against external benchmarks and no internal circularity is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract does not specify any free parameters or invented entities; relies on general optimization theory.

axioms (1)

domain assumption The loss landscape for quadratic activations in under-parameterized regime has a strongly convex region reachable by GD from random init.
This is the key premise extracted from the abstract for the convergence claim.

pith-pipeline@v0.9.0 · 5627 in / 1017 out tokens · 30587 ms · 2026-05-25T09:20:10.217596+00:00 · methodology

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)