Optimal L2 Regularization in High-dimensional Continual Linear Regression

Daniel Soudry; Edward Moroshko; Gilad Karpel; Itay Evron; Ran Levinstein; Ron Meir

arxiv: 2601.13844 · v2 · submitted 2026-01-20 · 💻 cs.LG

Optimal L2 Regularization in High-dimensional Continual Linear Regression

Gilad Karpel , Edward Moroshko , Ran Levinstein , Ron Meir , Daniel Soudry , Itay Evron This is my paper

Pith reviewed 2026-05-16 12:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learninglinear regressionL2 regularizationhigh-dimensional asymptoticsgeneralization errorscaling lawsoverparameterized modelstask sequences

0 comments

The pith

The optimal fixed L2 regularization strength in high-dimensional continual linear regression scales as T over ln T with the number of tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines generalization when an overparameterized linear model learns a sequence of regression tasks one by one using a single fixed isotropic L2 regularizer. It derives an exact closed-form expression for the expected test error that holds in the high-dimensional limit for any sequence of linear target functions. The analysis shows that this regularization reduces the effect of label noise whether the targets come from one teacher or from multiple independent ones. The central result is that the regularization coefficient minimizing the error grows nearly linearly with task count T, specifically in proportion to T divided by the natural logarithm of T. Experiments on both linear models and neural networks confirm the predicted scaling and its impact on performance.

Core claim

In the high-dimensional regime, the expected generalization loss of continual linear regression under fixed isotropic L2 regularization admits a closed-form expression valid for arbitrary linear teachers. Minimizing this loss with respect to the regularization coefficient yields an optimal strength that scales asymptotically as T / ln T, where T denotes the number of tasks. The same regularizer mitigates label noise in both single-teacher and multiple i.i.d. teacher settings without requiring storage of past data.

What carries the argument

The closed-form expression for expected generalization loss obtained under high-dimensional asymptotics, which is then minimized to obtain the optimal regularization coefficient.

If this is right

Isotropic L2 regularization alone suffices to control label noise across multiple teachers without memory of past tasks.
The T / ln T rule supplies a concrete schedule for choosing regularization strength as new tasks arrive.
The derived scaling governs generalization behavior in both exact linear regression and in trained neural networks.
The result applies to arbitrary linear teachers rather than restricted families of targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scaling might appear in non-linear continual settings if high-dimensional approximations remain valid.
Practitioners could adjust the regularization coefficient upward with each new task according to the observed count.
The scaling offers a possible lens for analyzing forgetting in sequential training through effective regularization strength.
The approach may connect to other sequential estimation problems where ridge penalties accumulate over data streams.

Load-bearing premise

The closed-form expression and resulting scaling law hold exactly only under the high-dimensional regime assumptions for arbitrary linear teachers.

What would settle it

A high-dimensional simulation with increasing task count T in which the empirically optimal regularization strength deviates from proportionality to T / ln T would falsify the scaling claim.

Figures

Figures reproduced from arXiv: 2601.13844 by Daniel Soudry, Edward Moroshko, Gilad Karpel, Itay Evron, Ran Levinstein, Ron Meir.

**Figure 1.** Figure 1: Regularization effects on synthetic data. Interactions between number of tasks (T), regularization strength (λ), and resulting generalization loss (normalized by a trivial regressor w = 0). Here, the label noise and feature variances are vz = vx = 1; Each curve averages over 5 runs. While the unregularized scheme collapses under label noise, optimal regularization endures. The unregularized scheme solved … view at source ↗

**Figure 2.** Figure 2: Optimal regularization substantially im [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Regularization in linear models and neural networks (NN). We plot the held-out classification error after learning T noisy tasks. Curves are averaged over 5 seeds; stars mark minima. 5.3. Neural Network Experiments: Validation Beyond Linear Models We also consider a two-layer ReLU neural network (784→500→1) with a sigmoid output, trained continually with L2 regularization. Again, noisy binary labels are g… view at source ↗

**Figure 4.** Figure 4: Regularization strength vs. Task horizon. Stars mark the empirical optimal regularization strength obtained after training on a noisy sequence of T tasks and averaging over 5 random seeds. We also plot the analytical optima, predicted by our Theorem 6; shown as a white curve. We observe a seemingly-multiplicative mismatch between the empirical and analytical optima. 8. By normalization we mean per-feature… view at source ↗

**Figure 5.** Figure 5: Empirical optimal regularization strength vs. Task horizon. We plot the empirical optimal strength after learning a noisy sequence with T tasks—averaged over 5 seeds—and compare it to the optimum predicted by our analysis. B.3. Aspect-Ratio Scaling Predicted by Theory Extends Beyond i.i.d. Features We evaluate a linear model trained on the original non-whitened MNIST-based data, but vary the aspect ratio n… view at source ↗

**Figure 6.** Figure 6: Effect of aspect ratio on optimal regularization strength. We plot the empirical optimal regularizer after learning a noisy sequence with T tasks. Each curve reports the average empirical and theoretical values over 5 random seeds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a closed-form expected generalization loss for high-dimensional continual linear regression with fixed isotropic L2 regularization that works for arbitrary linear teachers, and shows the optimal fixed lambda scales as T/ln T under i.i.d. teachers.

read the letter

The paper's main result is a closed-form expression for expected generalization loss in overparameterized continual linear regression under fixed isotropic L2 regularization. It also proves that the best fixed regularization strength scales as T over ln T when teachers are i.i.d. This is new. Prior continual learning theory on linear models either skipped regularization or used memory-heavy replay methods. The closed form gives a clean way to see how regularization trades off bias against the label noise that builds up over tasks in the high-dimensional limit. The experiments on linear regression and small neural nets confirm the scaling affects performance as expected. One soft spot is the reach of the scaling law. The loss formula holds for arbitrary linear teachers, but the T/ln T optimum is derived for the i.i.d. teacher case. When teachers have different norms or covariances the accumulated noise term no longer factors the same way, so the argmin over lambda can shift. The paper should flag this boundary explicitly. The high-dimensional asymptotic setting is another limit; the neural net experiments are helpful but do not fully close the gap to finite dimensions. This work is for people studying theoretical continual learning and scaling laws for regularization in sequential training. Readers who want exact expressions or a simple recipe for lambda in linear continual settings will get value from it. It deserves a serious referee because the closed form is a real theoretical step even if the scaling needs tighter qualification on teacher statistics. I would send it to review.

Referee Report

1 major / 1 minor

Summary. The paper studies generalization in overparameterized continual linear regression with isotropic L2 regularization applied across a sequence of tasks. It derives a closed-form expression for expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. The work shows that this regularization mitigates label noise for both single-teacher and multiple i.i.d. teacher settings, proves that the optimal fixed regularization strength scales as T/ln T, and validates the results via experiments on linear regression and neural networks, offering a practical recipe for continual learning systems.

Significance. If the closed-form derivation and T/ln T scaling hold, the paper provides the first explicit theoretical result on optimal regularization scaling in continual learning. The closed-form for arbitrary teachers and the experimental bridge to neural networks would be valuable for understanding bias-variance tradeoffs in sequential high-dimensional settings.

major comments (1)

[Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.

minor comments (1)

[Experiments] Experiments section: the neural network validation should explicitly state the architecture, task sequence construction, and how the linear insights transfer to confirm the practical recipe.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below and will incorporate the necessary clarification in the revised manuscript.

read point-by-point responses

Referee: [Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.

Authors: We agree that the T/ln T scaling derivation relies on the i.i.d. teacher assumption, where the accumulated noise term factors uniformly. The closed-form expected generalization loss itself is derived for arbitrary linear teachers and does not require this assumption. We will revise the abstract and the scaling-law section to explicitly state that the optimal fixed regularization scaling of T/ln T holds under the multiple i.i.d. teacher setting, while for heterogeneous teachers (varying norms or covariances) the optimal lambda may deviate and depend on the specific teacher parameters. This clarification preserves the generality of the closed-form result while accurately delimiting the scope of the scaling law. revision: yes

Circularity Check

0 steps flagged

Derivation remains self-contained; scaling obtained from first-principles analysis of closed-form loss

full rationale

The paper derives a closed-form expected generalization loss for arbitrary linear teachers in the high-dimensional regime, then obtains the T/ln T scaling for optimal lambda by direct minimization of that expression. No step reduces a prediction to a fitted parameter defined on the same data, nor does any load-bearing claim rest solely on a self-citation whose content is unverified. The derivation is therefore independent of the target result and receives only a minor self-citation penalty.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard high-dimensional asymptotic assumptions for linear regression and the continual training protocol with fixed isotropic regularization; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption High-dimensional regime limit for linear regression with arbitrary teachers
Invoked to obtain the closed-form expected generalization loss.

pith-pipeline@v0.9.0 · 5462 in / 1202 out tokens · 26371 ms · 2026-05-16T12:51:57.944658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

(cited on p. 1) John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(99):2899–2934, 2009. (cited on p. 12) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on L...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun

(cited on p. 2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun. Convergence and implicit bias of gradient descent on continual linear classification. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. (cited on p. 1) Mikhail Khodak, Maria-Florina Balcan, ...

work page 2025
[3]

(cited on p. 12) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, et al. Overcom- ing catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. (...

work page doi:10.1073/pnas.1611835114 2017
[4]

arXiv preprint arXiv:2403.05175 , year=

Elsevier, 1989. (cited on p. 1) Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. InProceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, pages 525–533. JMLR Workshop and Conference Proceedings, 2011. (cited on p. 12) Martial Mermillod, Aur ´elia Bug...

work page arXiv 1989
[5]

6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin

(cited on p. 6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin. A statistical theory of regularization- based continual learning. InForty-first International Conference on Machine Learning, 2024. (cited on p. 2, 3, 12) Yihan Zhao, Wenqing Su, and Ying Yang. High-dimensional asymptotics of generalization per- formance in continual ridge regression.ar...

work page arXiv 2024
[6]

By normalization we mean per-feature centering and scaling to unit variance, i.e.,x j ←(x j − ˆE[xj])/ q dVar(xj); unlike whitening, this preprocessing does not remove cross-feature correlations

work page
[7]

−(λd)2 d d dλ −(−λd)I+v xZ⊤Z −1 # =− λ2d nvx E

The empirical generalization error is normalized by the maximum empirical generalization error across all regular- ization strengths for the specific task horizonT. That is, each column in the heatmap is scaled between0and1. 19 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON B.2. A Whitening Step Reconciles Experiments with Theory Here, prior to training, the inp...

work page 2022
[8]

kY m=t Pm # X⊤ k zk+ tX k=2

+ 1 λd tX k=1 " kY m=t Pm # X⊤ k zk+ tX k=2 " kY m=t Pm # w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ).(10) 24 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION We proceed to the calculation of the main resultE[G T ] = 1 T PT i=1 E h ∥wt −w ⋆ i ∥2 i . For convenience, we defineS i:j ≜Qj m=i Pm =P i . . .P j fori≥jandS i:j =Ifori < j. We use the stand...

work page
[9]

+ 1 λd tX k=1 St:kX⊤ k zk + tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E St:1 (w0 −w ⋆

work page
[10]

+ 1 λd tX k=1 St:kX⊤ k zk 2 | {z } term 1 + 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

work page
[11]

+ 1 λd tX k=1 St:kX⊤ k zk ! | {z } term 2 +E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 | {z } term 3 .(11) 25 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON In the following pages, we derive the terms above, employing the following properties

work page
[12]

,Pt−1 ii

For any1≤k≤t≤T, E " tY m=k Pm kY m=t Pm # =E[P k · · ·P t−1PtPtPt−1 · · ·P k] =E Pk,...,Pt−1 h EPt h Pk · · ·P t−1PtPtPt−1 · · ·P k Pk,P k+1, . . . ,Pt−1 ii

work page
[13]

=E Pk · · ·P t−1E P2 t Pt−1 · · ·P k =E Pk · · ·P t−1E P2 Pt−1 · · ·P k

work page
[14]

= E P2 t−k+1

=E P2 E[P k · · ·P t−1Pt−1 · · ·P k] =. . .= E P2 t−k+1

work page
[15]

,Pt−1 ii

For any1≤k ′ < k≤t≤T, E " tY m=k Pm kY m=t Pm k′ Y m=k−1 Pm # =E[P k · · ·P tPt · · ·P kPk−1 · · ·P k′] =E Pk′ ,...,Pt−1 h EPt h Pk · · ·P tPt · · ·P kPk−1 · · ·P k′ Pk′, . . . ,Pt−1 ii

work page
[16]

,Pt−1 ii

=E Pk′ ,...,Pt−1 h EPt h Pk′ · · ·P t−1E P2 t Pt−1 · · ·P k · · ·P k′ Pk′, . . . ,Pt−1 ii

work page
[17]

P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′,

=E P2 E[P k′ · · ·P t−1Pt−1 · · ·P k . . .P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′, . . . ,Pk−2 ii E P2 t−k+1

work page
[18]

=E[E[P k′ · · ·P k−2E[P k−1]]]E P2 t−k+1

work page
[19]

=E[P] k−k′ E P2 t−k+1

=E[P]E[P k′ · · ·P k−2]E P2 t−k+1 =. . .=E[P] k−k′ E P2 t−k+1 . Note that whenk=k ′, it reduces to case [1]

work page
[20]

[4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs

The data matrices are assumed to be independent of the past (see Section 2.1). [4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs. (5) and (6), which give closed-form expressions forE[P]andE[P 2], i.e., both are scalar multiples of the identity matrix, then commutativity is automatically implied. 26 OPTIMALL2 REGULARIZATION INHIGH-DIMENSION...

work page
[21]

+ 1 λd tX k=1 St:kX⊤ k zk 2 =E (w0 −w ⋆ 1)⊤S⊤ t:1 + 1 λd tX k=1 z⊤ k XkS⊤ t:k ! St:1(w0 −w ⋆

work page
[22]

+ 1 λd tX k=1 St:kX⊤ k zk ! [*] =E (w0 −w ⋆ 1)⊤ S⊤ t:1St:1 (w0 −w ⋆

work page
[23]

tX k=1 tX k′=1 z⊤ k XkS⊤ t:kSt:kX⊤ k′zk′ # = 1 (λd)2 E

+ 1 (λd)2 tX k=1 (zk)⊤ XkS⊤ t:k tX k′=1 St:k′X⊤ k′zk′ ! [**] = E P2 t ∥w0 −w ⋆ 1∥2 +v zdE 1 (λd)2Pk (Xk)⊤ XkPk 1− E P2 t 1−E[P 2] , where [*] follows by Section 2.1 as the noise variables are sampled independently across tasks with E[z t] =0, and [**] follows as the left inner term (in the third line) is, (w0 −w ⋆ 1)⊤ E h S⊤ t:1St:1 i (w0 −w ⋆ 1) [1,4] = ...

work page
[24]

=v z tX k=1 Tr E 1 (λd)2PkX⊤ k XkPk E P2 t−k

work page
[25]

27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

=v z tX k=1 dE 1 (λd)2PkX⊤ k XkPk E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk tX k=1 E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk 1− E P2 t 1−E[P 2] , where [1], [4] are explained above. 27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

work page
[26]

+ 1 λ tX k=1 St:kX⊤ k zk ! = 2E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! St:1 (w0 −w ⋆

work page
[27]

+ 1 λ tX k=1 St:kX⊤ k zk ! + (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆

work page
[28]

+ 1 λ tX k=1 St:kX⊤ k zk !# = 2E tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:kSt:kSk−1:1 (w0 −w ⋆

work page
[29]

+ (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆ 1) #

work page
[30]

= 2 tX k=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−k+1 E[P] k−1 (w0 −w ⋆ 1) + (w⋆ t −w ⋆ i )⊤ E[P] t (w0 −w ⋆ 1) . Term 3: E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! tX k=2 St:k w⋆ k−1 −w ⋆ k ! + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E St:k w⋆ k−1 −w ⋆ k ! +∥w ⋆ t −w ⋆ i ∥2 # = tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E h S⊤ t:kSt:k′ i w⋆ k′−1 −w ⋆ k′ + ...

work page
[31]

To finalize the result, we sum all the terms we obtained

= tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−(k∨k′)+1 E[P] |k−k′| w⋆ k′−1 −w ⋆ k′ + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E[P] t−k+1 w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 , 28 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [2]10 is explained above. To finalize the result, we sum all the terms we obtained. First, we define a= E P2 2 , b=∥E[P]∥ 2 , c= d...

work page
[32]

(12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

+ 2bt (w⋆ t −w ⋆ i )⊤ (w0 −w ⋆ 1) + tX k=2 tX k′=2 at−max(k,k′)+1b|k−k′| w⋆ k−1 −w ⋆ k ⊤ w⋆ k′−1 −w ⋆ k′ + 2 tX k=2 bt−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 . (12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

work page
[33]

2 w⋆ s−1 −w ⋆ s ⊤ tX r=s w⋆ r−1 −w ⋆ r − w⋆ s−1 −w ⋆ s 2 # + 2 tX k=2 (1−α) t−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ t −(w ⋆ k −w ⋆ t ) +∥w ⋆ t −w ⋆ i ∥2 = 2 tX s=2 (1−α) t−s+1

In general, fork̸=k ′ one obtains a two-case identity: ifk > k ′ thenE[S ⊤ t:kSt:k′ ] =E[P 2] t−k+1 E[P] k−k′ , while ifk < k ′ thenE[S ⊤ t:kSt:k′ ] =E[P] k′−k E[P2] t−k′+1, where the factorE[P] |k−k′| appears on the side of the longer product. In our settingE[P] =bIandE[P 2] =aI(hence they commute), so both cases collapse to E[P2] t−(k∨k′)+1E[P] |k−k′|. ...

work page 2023
[34]

1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1

= 2Tr (Σ)−2E h (w⋆ i )⊤ ξi i = 2Tr (Σ)−2E h (w⋆ +ξ i)⊤ ξi i = 2Tr (Σ)−2E h ∥ξi∥2 i | {z } Tr(Σ) = 0, 44 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [1] follows by Eq. (10) where, wT = " 1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1 " kY m=T Pm # X⊤ k zk + TX k=2 " kY m=T Pm # w⋆ k−1 −w ⋆ k +w ⋆ T . Recall the identity and Eq. (5), E ...

work page
[35]

Moreover, ifα <1then d dλ J(t, λ) λ=0 <0, while ifα= 1there existsλ ′ >0such that d dλ J(t, λ) λ=λ′ <0. from Eq. (34), the Intermediate Value Theorem guarantees a pointλ ⋆ ∈(0,Λ] such that d dλ⋆ J(t, λ ⋆) = 0. On the compact intervalI= [0,Λ]ifα <1, andI= [ ¯λ,Λ]ifα= 1, the functionJ(t,·)attains a global minimum by Weierstrass, since the derivative is nega...

work page
[36]

57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6

An analytic function is a function locally given by a convergent power series. 57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6. Proof of Theorem 6: Key Result on Optimal Regularization Scaling Recall Theorem 6.Under i.i.d. teachers (Assumption 1) with non-zero mean teacherw ⋆, the opti- mal fixed regularization strength that minimizes the expected generali...

work page

[1] [1]

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

(cited on p. 1) John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(99):2899–2934, 2009. (cited on p. 12) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on L...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[2] [2]

2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun

(cited on p. 2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun. Convergence and implicit bias of gradient descent on continual linear classification. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. (cited on p. 1) Mikhail Khodak, Maria-Florina Balcan, ...

work page 2025

[3] [3]

(cited on p. 12) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, et al. Overcom- ing catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. (...

work page doi:10.1073/pnas.1611835114 2017

[4] [4]

arXiv preprint arXiv:2403.05175 , year=

Elsevier, 1989. (cited on p. 1) Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. InProceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, pages 525–533. JMLR Workshop and Conference Proceedings, 2011. (cited on p. 12) Martial Mermillod, Aur ´elia Bug...

work page arXiv 1989

[5] [5]

6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin

(cited on p. 6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin. A statistical theory of regularization- based continual learning. InForty-first International Conference on Machine Learning, 2024. (cited on p. 2, 3, 12) Yihan Zhao, Wenqing Su, and Ying Yang. High-dimensional asymptotics of generalization per- formance in continual ridge regression.ar...

work page arXiv 2024

[6] [6]

By normalization we mean per-feature centering and scaling to unit variance, i.e.,x j ←(x j − ˆE[xj])/ q dVar(xj); unlike whitening, this preprocessing does not remove cross-feature correlations

work page

[7] [7]

−(λd)2 d d dλ −(−λd)I+v xZ⊤Z −1 # =− λ2d nvx E

The empirical generalization error is normalized by the maximum empirical generalization error across all regular- ization strengths for the specific task horizonT. That is, each column in the heatmap is scaled between0and1. 19 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON B.2. A Whitening Step Reconciles Experiments with Theory Here, prior to training, the inp...

work page 2022

[8] [8]

kY m=t Pm # X⊤ k zk+ tX k=2

+ 1 λd tX k=1 " kY m=t Pm # X⊤ k zk+ tX k=2 " kY m=t Pm # w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ).(10) 24 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION We proceed to the calculation of the main resultE[G T ] = 1 T PT i=1 E h ∥wt −w ⋆ i ∥2 i . For convenience, we defineS i:j ≜Qj m=i Pm =P i . . .P j fori≥jandS i:j =Ifori < j. We use the stand...

work page

[9] [9]

+ 1 λd tX k=1 St:kX⊤ k zk + tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E St:1 (w0 −w ⋆

work page

[10] [10]

+ 1 λd tX k=1 St:kX⊤ k zk 2 | {z } term 1 + 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

work page

[11] [11]

+ 1 λd tX k=1 St:kX⊤ k zk ! | {z } term 2 +E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 | {z } term 3 .(11) 25 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON In the following pages, we derive the terms above, employing the following properties

work page

[12] [12]

,Pt−1 ii

For any1≤k≤t≤T, E " tY m=k Pm kY m=t Pm # =E[P k · · ·P t−1PtPtPt−1 · · ·P k] =E Pk,...,Pt−1 h EPt h Pk · · ·P t−1PtPtPt−1 · · ·P k Pk,P k+1, . . . ,Pt−1 ii

work page

[13] [13]

=E Pk · · ·P t−1E P2 t Pt−1 · · ·P k =E Pk · · ·P t−1E P2 Pt−1 · · ·P k

work page

[14] [14]

= E P2 t−k+1

=E P2 E[P k · · ·P t−1Pt−1 · · ·P k] =. . .= E P2 t−k+1

work page

[15] [15]

,Pt−1 ii

For any1≤k ′ < k≤t≤T, E " tY m=k Pm kY m=t Pm k′ Y m=k−1 Pm # =E[P k · · ·P tPt · · ·P kPk−1 · · ·P k′] =E Pk′ ,...,Pt−1 h EPt h Pk · · ·P tPt · · ·P kPk−1 · · ·P k′ Pk′, . . . ,Pt−1 ii

work page

[16] [16]

,Pt−1 ii

=E Pk′ ,...,Pt−1 h EPt h Pk′ · · ·P t−1E P2 t Pt−1 · · ·P k · · ·P k′ Pk′, . . . ,Pt−1 ii

work page

[17] [17]

P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′,

=E P2 E[P k′ · · ·P t−1Pt−1 · · ·P k . . .P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′, . . . ,Pk−2 ii E P2 t−k+1

work page

[18] [18]

=E[E[P k′ · · ·P k−2E[P k−1]]]E P2 t−k+1

work page

[19] [19]

=E[P] k−k′ E P2 t−k+1

=E[P]E[P k′ · · ·P k−2]E P2 t−k+1 =. . .=E[P] k−k′ E P2 t−k+1 . Note that whenk=k ′, it reduces to case [1]

work page

[20] [20]

[4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs

The data matrices are assumed to be independent of the past (see Section 2.1). [4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs. (5) and (6), which give closed-form expressions forE[P]andE[P 2], i.e., both are scalar multiples of the identity matrix, then commutativity is automatically implied. 26 OPTIMALL2 REGULARIZATION INHIGH-DIMENSION...

work page

[21] [21]

+ 1 λd tX k=1 St:kX⊤ k zk 2 =E (w0 −w ⋆ 1)⊤S⊤ t:1 + 1 λd tX k=1 z⊤ k XkS⊤ t:k ! St:1(w0 −w ⋆

work page

[22] [22]

+ 1 λd tX k=1 St:kX⊤ k zk ! [*] =E (w0 −w ⋆ 1)⊤ S⊤ t:1St:1 (w0 −w ⋆

work page

[23] [23]

tX k=1 tX k′=1 z⊤ k XkS⊤ t:kSt:kX⊤ k′zk′ # = 1 (λd)2 E

+ 1 (λd)2 tX k=1 (zk)⊤ XkS⊤ t:k tX k′=1 St:k′X⊤ k′zk′ ! [**] = E P2 t ∥w0 −w ⋆ 1∥2 +v zdE 1 (λd)2Pk (Xk)⊤ XkPk 1− E P2 t 1−E[P 2] , where [*] follows by Section 2.1 as the noise variables are sampled independently across tasks with E[z t] =0, and [**] follows as the left inner term (in the third line) is, (w0 −w ⋆ 1)⊤ E h S⊤ t:1St:1 i (w0 −w ⋆ 1) [1,4] = ...

work page

[24] [24]

=v z tX k=1 Tr E 1 (λd)2PkX⊤ k XkPk E P2 t−k

work page

[25] [25]

27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

=v z tX k=1 dE 1 (λd)2PkX⊤ k XkPk E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk tX k=1 E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk 1− E P2 t 1−E[P 2] , where [1], [4] are explained above. 27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

work page

[26] [26]

+ 1 λ tX k=1 St:kX⊤ k zk ! = 2E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! St:1 (w0 −w ⋆

work page

[27] [27]

+ 1 λ tX k=1 St:kX⊤ k zk ! + (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆

work page

[28] [28]

+ 1 λ tX k=1 St:kX⊤ k zk !# = 2E tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:kSt:kSk−1:1 (w0 −w ⋆

work page

[29] [29]

+ (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆ 1) #

work page

[30] [30]

= 2 tX k=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−k+1 E[P] k−1 (w0 −w ⋆ 1) + (w⋆ t −w ⋆ i )⊤ E[P] t (w0 −w ⋆ 1) . Term 3: E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! tX k=2 St:k w⋆ k−1 −w ⋆ k ! + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E St:k w⋆ k−1 −w ⋆ k ! +∥w ⋆ t −w ⋆ i ∥2 # = tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E h S⊤ t:kSt:k′ i w⋆ k′−1 −w ⋆ k′ + ...

work page

[31] [31]

To finalize the result, we sum all the terms we obtained

= tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−(k∨k′)+1 E[P] |k−k′| w⋆ k′−1 −w ⋆ k′ + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E[P] t−k+1 w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 , 28 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [2]10 is explained above. To finalize the result, we sum all the terms we obtained. First, we define a= E P2 2 , b=∥E[P]∥ 2 , c= d...

work page

[32] [32]

(12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

+ 2bt (w⋆ t −w ⋆ i )⊤ (w0 −w ⋆ 1) + tX k=2 tX k′=2 at−max(k,k′)+1b|k−k′| w⋆ k−1 −w ⋆ k ⊤ w⋆ k′−1 −w ⋆ k′ + 2 tX k=2 bt−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 . (12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

work page

[33] [33]

2 w⋆ s−1 −w ⋆ s ⊤ tX r=s w⋆ r−1 −w ⋆ r − w⋆ s−1 −w ⋆ s 2 # + 2 tX k=2 (1−α) t−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ t −(w ⋆ k −w ⋆ t ) +∥w ⋆ t −w ⋆ i ∥2 = 2 tX s=2 (1−α) t−s+1

In general, fork̸=k ′ one obtains a two-case identity: ifk > k ′ thenE[S ⊤ t:kSt:k′ ] =E[P 2] t−k+1 E[P] k−k′ , while ifk < k ′ thenE[S ⊤ t:kSt:k′ ] =E[P] k′−k E[P2] t−k′+1, where the factorE[P] |k−k′| appears on the side of the longer product. In our settingE[P] =bIandE[P 2] =aI(hence they commute), so both cases collapse to E[P2] t−(k∨k′)+1E[P] |k−k′|. ...

work page 2023

[34] [34]

1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1

= 2Tr (Σ)−2E h (w⋆ i )⊤ ξi i = 2Tr (Σ)−2E h (w⋆ +ξ i)⊤ ξi i = 2Tr (Σ)−2E h ∥ξi∥2 i | {z } Tr(Σ) = 0, 44 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [1] follows by Eq. (10) where, wT = " 1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1 " kY m=T Pm # X⊤ k zk + TX k=2 " kY m=T Pm # w⋆ k−1 −w ⋆ k +w ⋆ T . Recall the identity and Eq. (5), E ...

work page

[35] [35]

Moreover, ifα <1then d dλ J(t, λ) λ=0 <0, while ifα= 1there existsλ ′ >0such that d dλ J(t, λ) λ=λ′ <0. from Eq. (34), the Intermediate Value Theorem guarantees a pointλ ⋆ ∈(0,Λ] such that d dλ⋆ J(t, λ ⋆) = 0. On the compact intervalI= [0,Λ]ifα <1, andI= [ ¯λ,Λ]ifα= 1, the functionJ(t,·)attains a global minimum by Weierstrass, since the derivative is nega...

work page

[36] [36]

57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6

An analytic function is a function locally given by a convergent power series. 57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6. Proof of Theorem 6: Key Result on Optimal Regularization Scaling Recall Theorem 6.Under i.i.d. teachers (Assumption 1) with non-zero mean teacherw ⋆, the opti- mal fixed regularization strength that minimizes the expected generali...

work page