Surfing: Iterative optimization over incrementally trained deep networks

Ganlin Song; John Lafferty; Zhou Fan

arxiv: 1907.08653 · v1 · pith:WNYNCNFQnew · submitted 2019-07-19 · 📊 stat.ML · cs.LG

Surfing: Iterative optimization over incrementally trained deep networks

Ganlin Song , Zhou Fan , John Lafferty This is my paper

Pith reviewed 2026-05-24 19:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords surfingiterative optimizationdeep networksempirical riskcompressed sensingstochastic gradient descentglobal optimizationexpansive networks

0 comments

The pith

Optimizing a sequence of risk functions from networks at successive training stages reaches the global minimum of the final nonconvex objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sequential optimization method that minimizes the empirical risk using network parameters obtained at different stages of stochastic gradient descent training. It begins with random initial parameters, for which the risk function is smooth and amenable to gradient descent, then incrementally tracks the optimum as the network parameters update and the risk surface gradually becomes more nonconvex. This procedure, termed surfing, is formalized and analyzed for expansive networks. Experiments demonstrate that it locates global optima and solves compressed sensing problems in cases where direct gradient descent applied to the final trained network fails to do so.

Core claim

The central claim is that the surfing procedure, by sequentially optimizing the sequence of risk functions f_theta_t(x) where theta_t denotes network parameters at successive stages of training, locates the global minimum of the final empirical risk f_hat theta(x) for certain families of deep networks. Because each stochastic gradient descent step changes the parameters only modestly, the risk surface evolves slowly enough that an optimizer can ride along the changing peak rather than restarting from scratch on the final wavy surface. The method is shown to succeed on global optimization and compressed sensing tasks even when standard gradient descent on the completed network does not.

What carries the argument

The surfing procedure, which incrementally optimizes the evolving empirical risk functions f_theta_t(x) as network parameters theta_t are updated during stochastic gradient descent training.

If this is right

Surfing recovers the global optimum of the final risk for expansive networks in regimes where direct optimization on the trained network fails.
The method enables compressed sensing by inverting the trained generative network through the incremental optimization path.
The slow evolution of the risk surface during training supplies a sequence of successively harder objectives that collectively guide the optimizer to the global solution.
Analysis for expansive networks establishes conditions under which the incremental updates remain effective throughout training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incremental-tracking idea could be applied to other iterative training processes whose parameter trajectories change gradually.
Surfing might be combined with restarts or momentum adjustments to handle cases where parameter updates occasionally become large.
The approach suggests that the training trajectory itself can serve as a curriculum of optimization problems leading to the final objective.

Load-bearing premise

The parameters of the network do not change by very much in each step of stochastic gradient descent, allowing the risk surface to evolve slowly and be incrementally optimized.

What would settle it

A controlled experiment on an expansive network in which the global minimum of the final risk is known by exhaustive search, direct gradient descent on the final network misses it, and surfing also misses it even though parameter changes per step remain small.

read the original abstract

We investigate a sequential optimization procedure to minimize the empirical risk functional $f_{\hat\theta}(x) = \frac{1}{2}\|G_{\hat\theta}(x) - y\|^2$ for certain families of deep networks $G_{\theta}(x)$. The approach is to optimize a sequence of objective functions that use network parameters obtained during different stages of the training process. When initialized with random parameters $\theta_0$, we show that the objective $f_{\theta_0}(x)$ is "nice'' and easy to optimize with gradient descent. As learning is carried out, we obtain a sequence of generative networks $x \mapsto G_{\theta_t}(x)$ and associated risk functions $f_{\theta_t}(x)$, where $t$ indicates a stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step, the surface evolves slowly and can be incrementally optimized. The algorithm is formalized and analyzed for a family of expansive networks. We call the procedure {\it surfing} since it rides along the peak of the evolving (negative) empirical risk function, starting from a smooth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. Experiments show how surfing can be used to find the global optimum and for compressed sensing even when direct gradient descent on the final learned network fails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Surfing is a sequential optimization along the SGD training path that works in some experiments where direct GD on the final network fails, backed by analysis limited to expansive networks.

read the letter

The main point is that this paper defines surfing as optimizing a sequence of risk functions f_θt while the network parameters θt are still moving under SGD. They start from random θ0 where the surface is smooth enough for plain gradient descent to succeed, then ride the slowly changing surface to the final nonconvex f_θ̂. The claim is that this reaches better points than optimizing the trained network directly, with supporting experiments on global optimization and compressed sensing tasks.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes 'surfing', a sequential optimization procedure that minimizes the empirical risk f_θ̂(x) = ½‖G_θ̂(x) − y‖² by optimizing a sequence of objectives f_θt(x) whose parameters θt are taken from successive stages of SGD training of a deep network G_θ. It claims that, for a family of expansive networks, the procedure can be formalized and analyzed because parameter changes are small enough that the surface evolves slowly and incremental gradient steps can track the moving minimizer; experiments are presented showing that surfing recovers global optima and solves compressed-sensing tasks in regimes where direct gradient descent on the final trained network fails.

Significance. If the analysis for expansive networks supplies a rigorous guarantee that the surfing trajectory reaches the global argmin of the final surface and the experimental successes are reproducible across network families, the result would be significant: it would demonstrate a practical route to non-convex optimization that exploits the training trajectory itself rather than treating the final loss landscape as a static black box.

major comments (2)

[Analysis for expansive networks] Analysis for expansive networks (the section formalizing the algorithm): the central claim that incremental gradient descent on the sequence f_θt tracks the global minimizer rests on the unquantified statement that 'parameters do not change by very much in each step.' No modulus of continuity on θt, bound on ‖∇x f_θt‖ drift, or theorem establishing that a fixed number of inner steps suffices to stay near the moving argmin across the entire training trajectory is supplied; without such control the experimental successes could be artifacts of particular initializations or schedules.
[Experiments] Experiments section (global-optimum and compressed-sensing tasks): the claim that surfing succeeds where direct GD on the final network fails is load-bearing for the practical contribution, yet the manuscript reports neither the number of incremental gradient steps taken per stage, the observed drift in the location of the minimizer between consecutive θt, nor ablations that isolate the surfing mechanism from the choice of expansive-network architecture.

minor comments (1)

[Abstract] The abstract describes f_θ0(x) as 'nice' without specifying the precise properties (strong convexity, smoothness constant, or absence of spurious local minima) that make it easy to optimize; a short quantitative statement would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify areas where the manuscript can be strengthened with additional quantitative detail. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Analysis for expansive networks (the section formalizing the algorithm): the central claim that incremental gradient descent on the sequence f_θt tracks the global minimizer rests on the unquantified statement that 'parameters do not change by very much in each step.' No modulus of continuity on θt, bound on ‖∇x f_θt‖ drift, or theorem establishing that a fixed number of inner steps suffices to stay near the moving argmin across the entire training trajectory is supplied; without such control the experimental successes could be artifacts of particular initializations or schedules.

Authors: We agree that the existing analysis for expansive networks provides only a qualitative statement regarding slow parameter evolution and does not supply the requested quantitative controls. In the revised manuscript we will add an explicit modulus of continuity on the training trajectory θt, a bound on the resulting drift of ∇x f_θt, and a theorem that guarantees a fixed number of inner gradient steps suffices to remain near the moving argmin for the entire sequence. These additions will make the formal claim rigorous rather than heuristic. revision: yes
Referee: Experiments section (global-optimum and compressed-sensing tasks): the claim that surfing succeeds where direct GD on the final network fails is load-bearing for the practical contribution, yet the manuscript reports neither the number of incremental gradient steps taken per stage, the observed drift in the location of the minimizer between consecutive θt, nor ablations that isolate the surfing mechanism from the choice of expansive-network architecture.

Authors: We concur that these experimental details are necessary for reproducibility and for isolating the surfing mechanism. The revised version will report the exact number of inner gradient steps used at each stage, quantify the observed drift of the minimizer between consecutive θt, and include ablations that compare surfing against direct optimization on both expansive and non-expansive architectures. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained algorithmic description

full rationale

The paper presents surfing as a sequential procedure that incrementally optimizes a sequence of risk functions f_θt(x) while network parameters evolve under SGD, starting from an initial easy-to-optimize surface at θ0. The formalization and analysis for expansive networks rests on the stated (but unquantified) premise that parameter changes per step are small enough for the surface to evolve slowly; this is an explicit modeling assumption rather than a quantity defined in terms of the output or a fitted parameter renamed as a prediction. No equations reduce the claimed global-optimum tracking to a tautology, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled in. The experimental claims are presented as empirical outcomes on specific networks and tasks, not as derivations forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text required for complete ledger.

pith-pipeline@v0.9.0 · 5777 in / 888 out tokens · 17590 ms · 2026-05-24T19:03:32.736110+00:00 · methodology

Surfing: Iterative optimization over incrementally trained deep networks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)