pith. sign in

arxiv: 1907.06594 · v2 · pith:LRHCJJ4Bnew · submitted 2019-07-04 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Pith reviewed 2026-05-25 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML
keywords one-hidden-layer neural networksquadratic activationsgradient descentrandom initializationnon-convex optimizationprovable convergenceunder-parameterized regimelinear convergence
0
0 comments X

The pith

Gradient descent with random initialization enters a strongly convex and smooth region and converges linearly to the global optimum for one-hidden-layer quadratic neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the training of one-hidden-layer neural networks with quadratic activations in the under-parameterized regime. It demonstrates that gradient descent starting from random initialization quickly reaches a local region with strong convexity and smoothness properties. From there, the iterates converge to the global optimum at a linear rate. The approach achieves this with sample complexity that is near optimal, offering a provable method for a non-convex problem.

Core claim

For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity.

What carries the argument

The local region with strong convexity and smoothness reached by gradient descent iterates from random initialization.

Load-bearing premise

The proof relies on quadratic activation functions and the under-parameterized regime to guarantee the existence of a strongly convex and smooth local region reachable by random initialization.

What would settle it

A counterexample where gradient descent with random initialization does not enter the strongly convex region or fails to converge linearly for quadratic activations in the under-parameterized regime would disprove the claim.

read the original abstract

Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper considers learning a one-hidden-layer neural network with quadratic activations in the under-parameterized regime (fewer hidden units than input dimension). It claims that gradient descent with random initialization enters a local region enjoying strong convexity and smoothness within a few iterations, after which it converges linearly to a globally optimal model at near-optimal sample complexity; the claims are said to be corroborated by experiments.

Significance. If the result holds, the work would be significant for providing a concrete mechanism (rapid entry into a strongly convex/smooth basin) that explains linear convergence of GD on a non-convex neural-network objective, together with near-optimal sample complexity, in a simplified but non-trivial setting.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their careful review and for identifying a key presentational issue with the central claim in the abstract. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations' is load-bearing for both the linear-rate convergence and the 'provable' qualifier, yet the manuscript supplies no derivation, no statement of the required initialization variance or data-distribution assumptions, and no argument establishing that this region is reachable in few steps and is indeed strongly convex/smooth.

    Authors: The referee correctly observes that the provided abstract contains no derivations, no explicit statements of initialization variance or data-distribution assumptions, and no argument showing that the iterates reach a strongly convex/smooth region in few steps. Because only the abstract is available, the manuscript as presented here indeed supplies none of these elements. We therefore agree that the abstract, standing alone, does not substantiate the load-bearing claim. To remedy this, we will revise the abstract to state the key assumptions on initialization variance and data distribution and to give a high-level indication of the argument establishing entry into the strongly convex region. revision: yes

standing simulated objections not resolved
  • The full manuscript body containing the actual derivations, proofs, and technical arguments is not provided, so we cannot reference or reproduce those details in this response.

Circularity Check

0 steps flagged

No circularity detectable; only abstract available

full rationale

Only the abstract is provided, which states the claim of GD entering a strongly convex/smooth region then converging linearly but supplies no equations, parameter fits, self-citations, or derivation steps. No load-bearing reduction to inputs by construction can be exhibited because no derivation chain exists in the text. This is the normal honest outcome when the paper is self-contained against external benchmarks and no internal circularity is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract does not specify any free parameters or invented entities; relies on general optimization theory.

axioms (1)
  • domain assumption The loss landscape for quadratic activations in under-parameterized regime has a strongly convex region reachable by GD from random init.
    This is the key premise extracted from the abstract for the convergence claim.

pith-pipeline@v0.9.0 · 5627 in / 1017 out tokens · 30587 ms · 2026-05-25T09:20:10.217596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.