Optimal L2 Regularization in High-dimensional Continual Linear Regression
Pith reviewed 2026-05-16 12:51 UTC · model grok-4.3
The pith
The optimal fixed L2 regularization strength in high-dimensional continual linear regression scales as T over ln T with the number of tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the high-dimensional regime, the expected generalization loss of continual linear regression under fixed isotropic L2 regularization admits a closed-form expression valid for arbitrary linear teachers. Minimizing this loss with respect to the regularization coefficient yields an optimal strength that scales asymptotically as T / ln T, where T denotes the number of tasks. The same regularizer mitigates label noise in both single-teacher and multiple i.i.d. teacher settings without requiring storage of past data.
What carries the argument
The closed-form expression for expected generalization loss obtained under high-dimensional asymptotics, which is then minimized to obtain the optimal regularization coefficient.
If this is right
- Isotropic L2 regularization alone suffices to control label noise across multiple teachers without memory of past tasks.
- The T / ln T rule supplies a concrete schedule for choosing regularization strength as new tasks arrive.
- The derived scaling governs generalization behavior in both exact linear regression and in trained neural networks.
- The result applies to arbitrary linear teachers rather than restricted families of targets.
Where Pith is reading between the lines
- Similar scaling might appear in non-linear continual settings if high-dimensional approximations remain valid.
- Practitioners could adjust the regularization coefficient upward with each new task according to the observed count.
- The scaling offers a possible lens for analyzing forgetting in sequential training through effective regularization strength.
- The approach may connect to other sequential estimation problems where ridge penalties accumulate over data streams.
Load-bearing premise
The closed-form expression and resulting scaling law hold exactly only under the high-dimensional regime assumptions for arbitrary linear teachers.
What would settle it
A high-dimensional simulation with increasing task count T in which the empirically optimal regularization strength deviates from proportionality to T / ln T would falsify the scaling claim.
Figures
read the original abstract
We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies generalization in overparameterized continual linear regression with isotropic L2 regularization applied across a sequence of tasks. It derives a closed-form expression for expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. The work shows that this regularization mitigates label noise for both single-teacher and multiple i.i.d. teacher settings, proves that the optimal fixed regularization strength scales as T/ln T, and validates the results via experiments on linear regression and neural networks, offering a practical recipe for continual learning systems.
Significance. If the closed-form derivation and T/ln T scaling hold, the paper provides the first explicit theoretical result on optimal regularization scaling in continual learning. The closed-form for arbitrary teachers and the experimental bridge to neural networks would be valuable for understanding bias-variance tradeoffs in sequential high-dimensional settings.
major comments (1)
- [Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.
minor comments (1)
- [Experiments] Experiments section: the neural network validation should explicitly state the architecture, task sequence construction, and how the linear insights transfer to confirm the practical recipe.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the major comment below and will incorporate the necessary clarification in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.
Authors: We agree that the T/ln T scaling derivation relies on the i.i.d. teacher assumption, where the accumulated noise term factors uniformly. The closed-form expected generalization loss itself is derived for arbitrary linear teachers and does not require this assumption. We will revise the abstract and the scaling-law section to explicitly state that the optimal fixed regularization scaling of T/ln T holds under the multiple i.i.d. teacher setting, while for heterogeneous teachers (varying norms or covariances) the optimal lambda may deviate and depend on the specific teacher parameters. This clarification preserves the generality of the closed-form result while accurately delimiting the scope of the scaling law. revision: yes
Circularity Check
Derivation remains self-contained; scaling obtained from first-principles analysis of closed-form loss
full rationale
The paper derives a closed-form expected generalization loss for arbitrary linear teachers in the high-dimensional regime, then obtains the T/ln T scaling for optimal lambda by direct minimization of that expression. No step reduces a prediction to a fitted parameter defined on the same data, nor does any load-bearing claim rest solely on a self-citation whose content is unverified. The derivation is therefore independent of the target result and receives only a minor self-citation penalty.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-dimensional regime limit for linear regression with arbitrary teachers
Reference graph
Works this paper leans on
-
[1]
Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines
(cited on p. 1) John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(99):2899–2934, 2009. (cited on p. 12) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on L...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[2]
(cited on p. 2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun. Convergence and implicit bias of gradient descent on continual linear classification. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. (cited on p. 1) Mikhail Khodak, Maria-Florina Balcan, ...
work page 2025
-
[3]
(cited on p. 12) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, et al. Overcom- ing catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. (...
-
[4]
arXiv preprint arXiv:2403.05175 , year=
Elsevier, 1989. (cited on p. 1) Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. InProceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, pages 525–533. JMLR Workshop and Conference Proceedings, 2011. (cited on p. 12) Martial Mermillod, Aur ´elia Bug...
-
[5]
6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin
(cited on p. 6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin. A statistical theory of regularization- based continual learning. InForty-first International Conference on Machine Learning, 2024. (cited on p. 2, 3, 12) Yihan Zhao, Wenqing Su, and Ying Yang. High-dimensional asymptotics of generalization per- formance in continual ridge regression.ar...
-
[6]
By normalization we mean per-feature centering and scaling to unit variance, i.e.,x j ←(x j − ˆE[xj])/ q dVar(xj); unlike whitening, this preprocessing does not remove cross-feature correlations
-
[7]
−(λd)2 d d dλ −(−λd)I+v xZ⊤Z −1 # =− λ2d nvx E
The empirical generalization error is normalized by the maximum empirical generalization error across all regular- ization strengths for the specific task horizonT. That is, each column in the heatmap is scaled between0and1. 19 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON B.2. A Whitening Step Reconciles Experiments with Theory Here, prior to training, the inp...
work page 2022
-
[8]
+ 1 λd tX k=1 " kY m=t Pm # X⊤ k zk+ tX k=2 " kY m=t Pm # w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ).(10) 24 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION We proceed to the calculation of the main resultE[G T ] = 1 T PT i=1 E h ∥wt −w ⋆ i ∥2 i . For convenience, we defineS i:j ≜Qj m=i Pm =P i . . .P j fori≥jandS i:j =Ifori < j. We use the stand...
-
[9]
+ 1 λd tX k=1 St:kX⊤ k zk + tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E St:1 (w0 −w ⋆
-
[10]
+ 1 λd tX k=1 St:kX⊤ k zk 2 | {z } term 1 + 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆
-
[11]
+ 1 λd tX k=1 St:kX⊤ k zk ! | {z } term 2 +E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 | {z } term 3 .(11) 25 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON In the following pages, we derive the terms above, employing the following properties
- [12]
-
[13]
=E Pk · · ·P t−1E P2 t Pt−1 · · ·P k =E Pk · · ·P t−1E P2 Pt−1 · · ·P k
- [14]
- [15]
- [16]
-
[17]
P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′,
=E P2 E[P k′ · · ·P t−1Pt−1 · · ·P k . . .P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′, . . . ,Pk−2 ii E P2 t−k+1
-
[18]
=E[E[P k′ · · ·P k−2E[P k−1]]]E P2 t−k+1
-
[19]
=E[P]E[P k′ · · ·P k−2]E P2 t−k+1 =. . .=E[P] k−k′ E P2 t−k+1 . Note that whenk=k ′, it reduces to case [1]
-
[20]
[4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs
The data matrices are assumed to be independent of the past (see Section 2.1). [4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs. (5) and (6), which give closed-form expressions forE[P]andE[P 2], i.e., both are scalar multiples of the identity matrix, then commutativity is automatically implied. 26 OPTIMALL2 REGULARIZATION INHIGH-DIMENSION...
-
[21]
+ 1 λd tX k=1 St:kX⊤ k zk 2 =E (w0 −w ⋆ 1)⊤S⊤ t:1 + 1 λd tX k=1 z⊤ k XkS⊤ t:k ! St:1(w0 −w ⋆
-
[22]
+ 1 λd tX k=1 St:kX⊤ k zk ! [*] =E (w0 −w ⋆ 1)⊤ S⊤ t:1St:1 (w0 −w ⋆
-
[23]
tX k=1 tX k′=1 z⊤ k XkS⊤ t:kSt:kX⊤ k′zk′ # = 1 (λd)2 E
+ 1 (λd)2 tX k=1 (zk)⊤ XkS⊤ t:k tX k′=1 St:k′X⊤ k′zk′ ! [**] = E P2 t ∥w0 −w ⋆ 1∥2 +v zdE 1 (λd)2Pk (Xk)⊤ XkPk 1− E P2 t 1−E[P 2] , where [*] follows by Section 2.1 as the noise variables are sampled independently across tasks with E[z t] =0, and [**] follows as the left inner term (in the third line) is, (w0 −w ⋆ 1)⊤ E h S⊤ t:1St:1 i (w0 −w ⋆ 1) [1,4] = ...
-
[24]
=v z tX k=1 Tr E 1 (λd)2PkX⊤ k XkPk E P2 t−k
-
[25]
=v z tX k=1 dE 1 (λd)2PkX⊤ k XkPk E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk tX k=1 E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk 1− E P2 t 1−E[P 2] , where [1], [4] are explained above. 27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆
-
[26]
+ 1 λ tX k=1 St:kX⊤ k zk ! = 2E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! St:1 (w0 −w ⋆
-
[27]
+ 1 λ tX k=1 St:kX⊤ k zk ! + (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆
-
[28]
+ 1 λ tX k=1 St:kX⊤ k zk !# = 2E tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:kSt:kSk−1:1 (w0 −w ⋆
-
[29]
+ (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆ 1) #
-
[30]
= 2 tX k=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−k+1 E[P] k−1 (w0 −w ⋆ 1) + (w⋆ t −w ⋆ i )⊤ E[P] t (w0 −w ⋆ 1) . Term 3: E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! tX k=2 St:k w⋆ k−1 −w ⋆ k ! + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E St:k w⋆ k−1 −w ⋆ k ! +∥w ⋆ t −w ⋆ i ∥2 # = tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E h S⊤ t:kSt:k′ i w⋆ k′−1 −w ⋆ k′ + ...
-
[31]
To finalize the result, we sum all the terms we obtained
= tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−(k∨k′)+1 E[P] |k−k′| w⋆ k′−1 −w ⋆ k′ + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E[P] t−k+1 w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 , 28 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [2]10 is explained above. To finalize the result, we sum all the terms we obtained. First, we define a= E P2 2 , b=∥E[P]∥ 2 , c= d...
-
[32]
+ 2bt (w⋆ t −w ⋆ i )⊤ (w0 −w ⋆ 1) + tX k=2 tX k′=2 at−max(k,k′)+1b|k−k′| w⋆ k−1 −w ⋆ k ⊤ w⋆ k′−1 −w ⋆ k′ + 2 tX k=2 bt−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 . (12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i
-
[33]
In general, fork̸=k ′ one obtains a two-case identity: ifk > k ′ thenE[S ⊤ t:kSt:k′ ] =E[P 2] t−k+1 E[P] k−k′ , while ifk < k ′ thenE[S ⊤ t:kSt:k′ ] =E[P] k′−k E[P2] t−k′+1, where the factorE[P] |k−k′| appears on the side of the longer product. In our settingE[P] =bIandE[P 2] =aI(hence they commute), so both cases collapse to E[P2] t−(k∨k′)+1E[P] |k−k′|. ...
work page 2023
-
[34]
1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1
= 2Tr (Σ)−2E h (w⋆ i )⊤ ξi i = 2Tr (Σ)−2E h (w⋆ +ξ i)⊤ ξi i = 2Tr (Σ)−2E h ∥ξi∥2 i | {z } Tr(Σ) = 0, 44 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [1] follows by Eq. (10) where, wT = " 1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1 " kY m=T Pm # X⊤ k zk + TX k=2 " kY m=T Pm # w⋆ k−1 −w ⋆ k +w ⋆ T . Recall the identity and Eq. (5), E ...
-
[35]
Moreover, ifα <1then d dλ J(t, λ) λ=0 <0, while ifα= 1there existsλ ′ >0such that d dλ J(t, λ) λ=λ′ <0. from Eq. (34), the Intermediate Value Theorem guarantees a pointλ ⋆ ∈(0,Λ] such that d dλ⋆ J(t, λ ⋆) = 0. On the compact intervalI= [0,Λ]ifα <1, andI= [ ¯λ,Λ]ifα= 1, the functionJ(t,·)attains a global minimum by Weierstrass, since the derivative is nega...
-
[36]
57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6
An analytic function is a function locally given by a convergent power series. 57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6. Proof of Theorem 6: Key Result on Optimal Regularization Scaling Recall Theorem 6.Under i.i.d. teachers (Assumption 1) with non-zero mean teacherw ⋆, the opti- mal fixed regularization strength that minimizes the expected generali...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.