On the Theory of Continual Learning with Gradient Descent for Neural Networks

Arya Mazumdar; Avishek Ghosh; Hossein Taheri

arxiv: 2510.05573 · v2 · submitted 2025-10-07 · 📊 stat.ML · cs.IT· cs.LG· math.IT

On the Theory of Continual Learning with Gradient Descent for Neural Networks

Hossein Taheri , Avishek Ghosh , Arya Mazumdar This is my paper

Pith reviewed 2026-05-18 09:37 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT

keywords continual learningforgettinggradient descentneural networksquadratic networksgeneralization boundsalgorithmic stability

0 comments

The pith

Gradient descent on one-hidden-layer quadratic networks produces explicit bounds on forgetting that depend on iterations, samples, tasks, and width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to understand the forgetting that occurs when neural networks learn tasks one after another by focusing on a mathematically tractable case. It analyzes gradient descent training of one-hidden-layer quadratic networks on a sequence of datasets, each consisting of XOR clusters with means that are orthogonal across tasks. From a precise tracking of how the loss changes during training, the authors obtain explicit formulas that bound how much earlier task performance drops as training proceeds on later tasks. These bounds are then carried over to test data using ideas from algorithmic stability, producing guarantees on generalization error as well. A reader might care because these formulas reveal exactly which design choices, such as network width or training length, control the severity of forgetting.

Core claim

In the tractable yet representative setting of one-hidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with Gaussian noise and orthogonal means, a tight characterization of the gradient descent dynamics for the training loss yields explicit bounds on the rate of train-time forgetting as functions of the number of iterations, sample size, number of tasks, and hidden-layer width; leveraging an algorithmic stability framework then produces corresponding bounds on the generalization gap and thus on test-time forgetting.

What carries the argument

The tight characterization of gradient descent dynamics for the training loss on these quadratic networks, which produces explicit bounds on forgetting rates in terms of iterations, sample size, task number, and hidden width.

Load-bearing premise

The setting of one-hidden-layer quadratic networks on XOR-cluster datasets with orthogonal means is representative of broader continual learning phenomena in neural networks.

What would settle it

Measuring whether the observed rate of forgetting scales inversely with hidden-layer width, while holding iterations, samples, and task count fixed, would directly test the accuracy of the derived bounds.

read the original abstract

Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting earlier ones, is a central goal of artificial intelligence. To better understand its underlying mechanisms, we study the limitations of continual learning in a tractable yet representative setting. Specifically, we analyze one-hidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with Gaussian noise, where different tasks correspond to clusters with orthogonal means. Our analysis is based on a tight characterization of gradient descent dynamics for the training loss, which yields explicit bounds on the rate of train-time forgetting as functions of the number of iterations, sample size, number of tasks, and hidden-layer width. We then leverage an algorithmic stability framework to bound the generalization gap, leading to corresponding guarantees on test-time forgetting. Together, our results provide the first closed-form guarantees for forgetting in continual learning with neural networks and show how key problem parameters jointly govern forgetting dynamics. Numerical experiments corroborate our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives explicit forgetting bounds via GD dynamics and stability, but only after assuming orthogonal task means that decouple everything.

read the letter

This paper works out closed-form bounds on forgetting for gradient descent in one-hidden-layer quadratic networks trained sequentially on XOR clusters whose means are orthogonal. The main payoff is that they track the training loss evolution tightly enough to write down explicit rates for train-time forgetting in terms of steps, samples per task, number of tasks, and hidden width, then lift those to test-time guarantees with an algorithmic-stability argument. The numerics line up with the formulas inside their setup. That combination of explicit expressions and a stability step is the concrete advance over the usual empirical or asymptotic treatments in the area. The derivations look careful and the orthogonality is used openly to kill the cross-task interference terms, which is what makes the closed forms possible. The soft spot is exactly that reliance on orthogonality. Once the cluster means are allowed to have non-zero inner products, the gradient updates pick up extra coupling terms whose effect on forgetting cannot be bounded by the same simple functions of the parameters. The authors label the setting “tractable yet representative,” but the bounds are tied to the orthogonal case; relaxing it would require a different argument. Readers who care about quantitative continual-learning theory or about stability-based generalization in simple models will get the most out of it. The math is grounded enough and the claims are specific enough that the paper should go to referees rather than be desk-rejected. I would send it out for review and ask the authors to clarify how far the orthogonality can be relaxed before the closed forms break.

Referee Report

2 major / 2 minor

Summary. The manuscript studies continual learning in a tractable setting of one-hidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with orthogonal means. It derives a tight characterization of the GD training dynamics to obtain explicit bounds on the rate of train-time forgetting as functions of iterations, sample size, number of tasks, and hidden width. An algorithmic-stability argument is then used to bound the generalization gap and thereby obtain corresponding guarantees on test-time forgetting. The central claims are that these constitute the first closed-form guarantees for forgetting in neural-network continual learning and that the key problem parameters jointly govern the forgetting dynamics. Numerical experiments are presented to corroborate the theory.

Significance. If the derivations are correct, the work supplies the first explicit, closed-form expressions linking forgetting rates to concrete problem parameters (iterations, samples, tasks, width) in a neural-network continual-learning setting. The combination of exact GD dynamics analysis with algorithmic stability is a methodological strength, and the explicit parameter dependence could inform scaling laws and hyper-parameter choices. The orthogonal-means assumption enables the decoupling that yields closed form, but this also confines the immediate scope; the results therefore advance understanding within a controlled regime rather than providing general guarantees.

major comments (2)

[Abstract / setting description] Abstract and setting paragraph: the closed-form characterization of GD dynamics and the subsequent forgetting bounds are obtained by exploiting orthogonality of the task means, which causes cross-task gradient interference terms to vanish. The manuscript should clarify whether the joint-governance claim and the explicit functional dependence on the listed parameters survive when orthogonality is relaxed, because the skeptic note indicates that non-orthogonal means introduce additional cross terms whose effect on forgetting cannot be bounded by the same closed-form expressions.
[Stability argument for generalization] Algorithmic-stability section: the passage from train-time loss bounds to test-time forgetting guarantees via stability must make the dependence of the stability parameter on width, number of tasks, and sample size fully explicit. Without this, it is unclear whether the final test-time bounds remain tight or become vacuous when these parameters vary.

minor comments (2)

[Experimental setup] The description of the XOR-cluster data generation (means, noise variance, cluster sizes) should appear in a dedicated subsection with explicit equations so that the numerical experiments can be reproduced exactly.
[Notation] Notation for the per-task loss and the forgetting metric should be introduced once and used consistently; several places appear to switch between L_t and F_t without re-definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our results. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract / setting description] Abstract and setting paragraph: the closed-form characterization of GD dynamics and the subsequent forgetting bounds are obtained by exploiting orthogonality of the task means, which causes cross-task gradient interference terms to vanish. The manuscript should clarify whether the joint-governance claim and the explicit functional dependence on the listed parameters survive when orthogonality is relaxed, because the skeptic note indicates that non-orthogonal means introduce additional cross terms whose effect on forgetting cannot be bounded by the same closed-form expressions.

Authors: We agree that the orthogonality of task means is essential for the closed-form results, as it causes the cross-task gradient interference terms to vanish and enables the decoupling used in the GD dynamics analysis. The joint-governance claim and explicit functional dependence on iterations, samples, tasks, and width are derived specifically under this assumption. Relaxing orthogonality would introduce additional cross terms that prevent the same closed-form bounds. We will revise the abstract and setting description to explicitly note that the results hold in the orthogonal-means regime chosen for tractability, and that the current guarantees do not directly extend to the non-orthogonal case without further analysis. revision: yes
Referee: [Stability argument for generalization] Algorithmic-stability section: the passage from train-time loss bounds to test-time forgetting guarantees via stability must make the dependence of the stability parameter on width, number of tasks, and sample size fully explicit. Without this, it is unclear whether the final test-time bounds remain tight or become vacuous when these parameters vary.

Authors: We appreciate this observation. While the stability parameter depends on width, number of tasks, and sample size through the underlying loss bounds and generalization analysis, these dependencies are not stated explicitly in the final test-time expressions. We will revise the algorithmic-stability section to make the scaling with these parameters fully explicit, ensuring the test-time forgetting bounds remain informative and non-vacuous as the parameters vary. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from GD dynamics under explicit orthogonality assumption

full rationale

The paper derives explicit forgetting bounds from a direct analysis of gradient descent on one-hidden-layer quadratic networks trained on XOR-cluster data whose means are orthogonal by construction. Orthogonality causes cross-task gradient terms to vanish, decoupling the per-task loss trajectories and permitting closed-form expressions in terms of iteration count, sample size, number of tasks, and width; the subsequent algorithmic-stability argument for test-time forgetting inherits the same decoupled dynamics. No equation reduces a fitted parameter to a prediction, no result is obtained by renaming a known empirical pattern, and no load-bearing step rests on a self-citation whose content is itself unverified. The central claim therefore follows from first-principles stability analysis within the stated modeling assumptions rather than from any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the analysis assumes the quadratic network and orthogonal XOR-cluster construction capture essential continual-learning dynamics; no explicit free parameters or invented entities are named in the provided text.

axioms (1)

domain assumption One-hidden-layer quadratic networks on XOR-cluster tasks with orthogonal means form a tractable yet representative setting for continual learning.
Stated in abstract as the basis for the analysis.

pith-pipeline@v0.9.0 · 5706 in / 1275 out tokens · 23926 ms · 2026-05-18T09:37:52.505358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume that all μk+ and μk− are mutually orthogonal for all k ∈ [K] … Although our analysis can be extended to the more general case … this is beyond the scope of the present work.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Train-time forgetting) … |Ftr k,K| = Õ(ηT √(K−k)/(d√n) + …)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.