pith. sign in

arxiv: 2601.13844 · v2 · submitted 2026-01-20 · 💻 cs.LG

Optimal L2 Regularization in High-dimensional Continual Linear Regression

Pith reviewed 2026-05-16 12:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learninglinear regressionL2 regularizationhigh-dimensional asymptoticsgeneralization errorscaling lawsoverparameterized modelstask sequences
0
0 comments X

The pith

The optimal fixed L2 regularization strength in high-dimensional continual linear regression scales as T over ln T with the number of tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines generalization when an overparameterized linear model learns a sequence of regression tasks one by one using a single fixed isotropic L2 regularizer. It derives an exact closed-form expression for the expected test error that holds in the high-dimensional limit for any sequence of linear target functions. The analysis shows that this regularization reduces the effect of label noise whether the targets come from one teacher or from multiple independent ones. The central result is that the regularization coefficient minimizing the error grows nearly linearly with task count T, specifically in proportion to T divided by the natural logarithm of T. Experiments on both linear models and neural networks confirm the predicted scaling and its impact on performance.

Core claim

In the high-dimensional regime, the expected generalization loss of continual linear regression under fixed isotropic L2 regularization admits a closed-form expression valid for arbitrary linear teachers. Minimizing this loss with respect to the regularization coefficient yields an optimal strength that scales asymptotically as T / ln T, where T denotes the number of tasks. The same regularizer mitigates label noise in both single-teacher and multiple i.i.d. teacher settings without requiring storage of past data.

What carries the argument

The closed-form expression for expected generalization loss obtained under high-dimensional asymptotics, which is then minimized to obtain the optimal regularization coefficient.

If this is right

  • Isotropic L2 regularization alone suffices to control label noise across multiple teachers without memory of past tasks.
  • The T / ln T rule supplies a concrete schedule for choosing regularization strength as new tasks arrive.
  • The derived scaling governs generalization behavior in both exact linear regression and in trained neural networks.
  • The result applies to arbitrary linear teachers rather than restricted families of targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaling might appear in non-linear continual settings if high-dimensional approximations remain valid.
  • Practitioners could adjust the regularization coefficient upward with each new task according to the observed count.
  • The scaling offers a possible lens for analyzing forgetting in sequential training through effective regularization strength.
  • The approach may connect to other sequential estimation problems where ridge penalties accumulate over data streams.

Load-bearing premise

The closed-form expression and resulting scaling law hold exactly only under the high-dimensional regime assumptions for arbitrary linear teachers.

What would settle it

A high-dimensional simulation with increasing task count T in which the empirically optimal regularization strength deviates from proportionality to T / ln T would falsify the scaling claim.

Figures

Figures reproduced from arXiv: 2601.13844 by Daniel Soudry, Edward Moroshko, Gilad Karpel, Itay Evron, Ran Levinstein, Ron Meir.

Figure 1
Figure 1. Figure 1: Regularization effects on synthetic data. Interactions between number of tasks (T), reg￾ularization strength (λ), and resulting generalization loss (normalized by a trivial regressor w = 0). Here, the label noise and feature variances are vz = vx = 1; Each curve averages over 5 runs. While the unregularized scheme collapses under label noise, optimal regularization endures. The unregularized scheme solved … view at source ↗
Figure 2
Figure 2. Figure 2: Optimal regularization substantially im [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Regularization in linear models and neural networks (NN). We plot the held-out clas￾sification error after learning T noisy tasks. Curves are averaged over 5 seeds; stars mark minima. 5.3. Neural Network Experiments: Validation Beyond Linear Models We also consider a two-layer ReLU neural network (784→500→1) with a sigmoid output, trained continually with L2 regularization. Again, noisy binary labels are g… view at source ↗
Figure 4
Figure 4. Figure 4: Regularization strength vs. Task horizon. Stars mark the empirical optimal regulariza￾tion strength obtained after training on a noisy sequence of T tasks and averaging over 5 random seeds. We also plot the analytical optima, predicted by our Theorem 6; shown as a white curve. We observe a seemingly-multiplicative mismatch between the empirical and analytical optima. 8. By normalization we mean per-feature… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical optimal regularization strength vs. Task horizon. We plot the empirical optimal strength after learning a noisy sequence with T tasks—averaged over 5 seeds—and compare it to the optimum predicted by our analysis. B.3. Aspect-Ratio Scaling Predicted by Theory Extends Beyond i.i.d. Features We evaluate a linear model trained on the original non-whitened MNIST-based data, but vary the aspect ratio n… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of aspect ratio on optimal regularization strength. We plot the empirical optimal regularizer after learning a noisy sequence with T tasks. Each curve reports the average empirical and theoretical values over 5 random seeds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper studies generalization in overparameterized continual linear regression with isotropic L2 regularization applied across a sequence of tasks. It derives a closed-form expression for expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. The work shows that this regularization mitigates label noise for both single-teacher and multiple i.i.d. teacher settings, proves that the optimal fixed regularization strength scales as T/ln T, and validates the results via experiments on linear regression and neural networks, offering a practical recipe for continual learning systems.

Significance. If the closed-form derivation and T/ln T scaling hold, the paper provides the first explicit theoretical result on optimal regularization scaling in continual learning. The closed-form for arbitrary teachers and the experimental bridge to neural networks would be valuable for understanding bias-variance tradeoffs in sequential high-dimensional settings.

major comments (1)
  1. [Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.
minor comments (1)
  1. [Experiments] Experiments section: the neural network validation should explicitly state the architecture, task sequence construction, and how the linear insights transfer to confirm the practical recipe.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below and will incorporate the necessary clarification in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and scaling law section] Abstract and scaling derivation: the closed-form expected loss is stated to hold for arbitrary linear teachers, but the proof that optimal fixed lambda scales as T/ln T is presented immediately after referencing the multiple i.i.d. teacher setting. For heterogeneous teachers (varying norms or covariances), the accumulated noise term does not factor uniformly, so the argmin over lambda can deviate from T/ln T; this assumption must be clarified or the scaling extended to support the central claim.

    Authors: We agree that the T/ln T scaling derivation relies on the i.i.d. teacher assumption, where the accumulated noise term factors uniformly. The closed-form expected generalization loss itself is derived for arbitrary linear teachers and does not require this assumption. We will revise the abstract and the scaling-law section to explicitly state that the optimal fixed regularization scaling of T/ln T holds under the multiple i.i.d. teacher setting, while for heterogeneous teachers (varying norms or covariances) the optimal lambda may deviate and depend on the specific teacher parameters. This clarification preserves the generality of the closed-form result while accurately delimiting the scope of the scaling law. revision: yes

Circularity Check

0 steps flagged

Derivation remains self-contained; scaling obtained from first-principles analysis of closed-form loss

full rationale

The paper derives a closed-form expected generalization loss for arbitrary linear teachers in the high-dimensional regime, then obtains the T/ln T scaling for optimal lambda by direct minimization of that expression. No step reduces a prediction to a fitted parameter defined on the same data, nor does any load-bearing claim rest solely on a self-citation whose content is unverified. The derivation is therefore independent of the target result and receives only a minor self-citation penalty.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard high-dimensional asymptotic assumptions for linear regression and the continual training protocol with fixed isotropic regularization; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption High-dimensional regime limit for linear regression with arbitrary teachers
    Invoked to obtain the closed-form expected generalization loss.

pith-pipeline@v0.9.0 · 5462 in / 1202 out tokens · 26371 ms · 2026-05-16T12:51:57.944658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

    (cited on p. 1) John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(99):2899–2934, 2009. (cited on p. 12) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on L...

  2. [2]

    2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun

    (cited on p. 2, 3, 11) 14 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION Hyunji Jung, Hanseul Cho, and Chulhee Yun. Convergence and implicit bias of gradient descent on continual linear classification. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. (cited on p. 1) Mikhail Khodak, Maria-Florina Balcan, ...

  3. [3]

    (cited on p. 12) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, et al. Overcom- ing catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. (...

  4. [4]

    arXiv preprint arXiv:2403.05175 , year=

    Elsevier, 1989. (cited on p. 1) Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. InProceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, pages 525–533. JMLR Workshop and Conference Proceedings, 2011. (cited on p. 12) Martial Mermillod, Aur ´elia Bug...

  5. [5]

    6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin

    (cited on p. 6) Xuyang Zhao, Huiyuan Wang, Weiran Huang, and Wei Lin. A statistical theory of regularization- based continual learning. InForty-first International Conference on Machine Learning, 2024. (cited on p. 2, 3, 12) Yihan Zhao, Wenqing Su, and Ying Yang. High-dimensional asymptotics of generalization per- formance in continual ridge regression.ar...

  6. [6]

    By normalization we mean per-feature centering and scaling to unit variance, i.e.,x j ←(x j − ˆE[xj])/ q dVar(xj); unlike whitening, this preprocessing does not remove cross-feature correlations

  7. [7]

    −(λd)2 d d dλ −(−λd)I+v xZ⊤Z −1 # =− λ2d nvx E

    The empirical generalization error is normalized by the maximum empirical generalization error across all regular- ization strengths for the specific task horizonT. That is, each column in the heatmap is scaled between0and1. 19 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON B.2. A Whitening Step Reconciles Experiments with Theory Here, prior to training, the inp...

  8. [8]

    kY m=t Pm # X⊤ k zk+ tX k=2

    + 1 λd tX k=1 " kY m=t Pm # X⊤ k zk+ tX k=2 " kY m=t Pm # w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ).(10) 24 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION We proceed to the calculation of the main resultE[G T ] = 1 T PT i=1 E h ∥wt −w ⋆ i ∥2 i . For convenience, we defineS i:j ≜Qj m=i Pm =P i . . .P j fori≥jandS i:j =Ifori < j. We use the stand...

  9. [9]

    + 1 λd tX k=1 St:kX⊤ k zk + tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E St:1 (w0 −w ⋆

  10. [10]

    + 1 λd tX k=1 St:kX⊤ k zk 2 | {z } term 1 + 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

  11. [11]

    + 1 λd tX k=1 St:kX⊤ k zk ! | {z } term 2 +E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 | {z } term 3 .(11) 25 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON In the following pages, we derive the terms above, employing the following properties

  12. [12]

    ,Pt−1 ii

    For any1≤k≤t≤T, E " tY m=k Pm kY m=t Pm # =E[P k · · ·P t−1PtPtPt−1 · · ·P k] =E Pk,...,Pt−1 h EPt h Pk · · ·P t−1PtPtPt−1 · · ·P k Pk,P k+1, . . . ,Pt−1 ii

  13. [13]

    =E Pk · · ·P t−1E P2 t Pt−1 · · ·P k =E Pk · · ·P t−1E P2 Pt−1 · · ·P k

  14. [14]

    = E P2 t−k+1

    =E P2 E[P k · · ·P t−1Pt−1 · · ·P k] =. . .= E P2 t−k+1

  15. [15]

    ,Pt−1 ii

    For any1≤k ′ < k≤t≤T, E " tY m=k Pm kY m=t Pm k′ Y m=k−1 Pm # =E[P k · · ·P tPt · · ·P kPk−1 · · ·P k′] =E Pk′ ,...,Pt−1 h EPt h Pk · · ·P tPt · · ·P kPk−1 · · ·P k′ Pk′, . . . ,Pt−1 ii

  16. [16]

    ,Pt−1 ii

    =E Pk′ ,...,Pt−1 h EPt h Pk′ · · ·P t−1E P2 t Pt−1 · · ·P k · · ·P k′ Pk′, . . . ,Pt−1 ii

  17. [17]

    P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′,

    =E P2 E[P k′ · · ·P t−1Pt−1 · · ·P k . . .P k′] =· · ·=E[P k′ · · ·P k−1]E P2 t−k+1 =E Pk′ ,...,Pk−2 h EPk−1 h Pk′ · · ·P k−1 Pk′, . . . ,Pk−2 ii E P2 t−k+1

  18. [18]

    =E[E[P k′ · · ·P k−2E[P k−1]]]E P2 t−k+1

  19. [19]

    =E[P] k−k′ E P2 t−k+1

    =E[P]E[P k′ · · ·P k−2]E P2 t−k+1 =. . .=E[P] k−k′ E P2 t−k+1 . Note that whenk=k ′, it reduces to case [1]

  20. [20]

    [4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs

    The data matrices are assumed to be independent of the past (see Section 2.1). [4]E[P],E[P 2]are both multiples ofI, and thus commutative.By Eqs. (5) and (6), which give closed-form expressions forE[P]andE[P 2], i.e., both are scalar multiples of the identity matrix, then commutativity is automatically implied. 26 OPTIMALL2 REGULARIZATION INHIGH-DIMENSION...

  21. [21]

    + 1 λd tX k=1 St:kX⊤ k zk 2 =E (w0 −w ⋆ 1)⊤S⊤ t:1 + 1 λd tX k=1 z⊤ k XkS⊤ t:k ! St:1(w0 −w ⋆

  22. [22]

    + 1 λd tX k=1 St:kX⊤ k zk ! [*] =E (w0 −w ⋆ 1)⊤ S⊤ t:1St:1 (w0 −w ⋆

  23. [23]

    tX k=1 tX k′=1 z⊤ k XkS⊤ t:kSt:kX⊤ k′zk′ # = 1 (λd)2 E

    + 1 (λd)2 tX k=1 (zk)⊤ XkS⊤ t:k tX k′=1 St:k′X⊤ k′zk′ ! [**] = E P2 t ∥w0 −w ⋆ 1∥2 +v zdE 1 (λd)2Pk (Xk)⊤ XkPk 1− E P2 t 1−E[P 2] , where [*] follows by Section 2.1 as the noise variables are sampled independently across tasks with E[z t] =0, and [**] follows as the left inner term (in the third line) is, (w0 −w ⋆ 1)⊤ E h S⊤ t:1St:1 i (w0 −w ⋆ 1) [1,4] = ...

  24. [24]

    =v z tX k=1 Tr E 1 (λd)2PkX⊤ k XkPk E P2 t−k

  25. [25]

    27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

    =v z tX k=1 dE 1 (λd)2PkX⊤ k XkPk E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk tX k=1 E P2 t−k =v zdE 1 (λd)2PkX⊤ k XkPk 1− E P2 t 1−E[P 2] , where [1], [4] are explained above. 27 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON Term 2: 2E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) !⊤ St:1 (w0 −w ⋆

  26. [26]

    + 1 λ tX k=1 St:kX⊤ k zk ! = 2E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! St:1 (w0 −w ⋆

  27. [27]

    + 1 λ tX k=1 St:kX⊤ k zk ! + (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆

  28. [28]

    + 1 λ tX k=1 St:kX⊤ k zk !# = 2E tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:kSt:kSk−1:1 (w0 −w ⋆

  29. [29]

    + (w⋆ t −w ⋆ i )⊤ St:1 (w0 −w ⋆ 1) #

  30. [30]

    = 2 tX k=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−k+1 E[P] k−1 (w0 −w ⋆ 1) + (w⋆ t −w ⋆ i )⊤ E[P] t (w0 −w ⋆ 1) . Term 3: E tX k=2 St:k w⋆ k−1 −w ⋆ k + (w⋆ t −w ⋆ i ) 2 =E " tX k=2 w⋆ k−1 −w ⋆ k ⊤ S⊤ t:k ! tX k=2 St:k w⋆ k−1 −w ⋆ k ! + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E St:k w⋆ k−1 −w ⋆ k ! +∥w ⋆ t −w ⋆ i ∥2 # = tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E h S⊤ t:kSt:k′ i w⋆ k′−1 −w ⋆ k′ + ...

  31. [31]

    To finalize the result, we sum all the terms we obtained

    = tX k=2 tX k′=2 w⋆ k−1 −w ⋆ k ⊤ E P2 t−(k∨k′)+1 E[P] |k−k′| w⋆ k′−1 −w ⋆ k′ + 2 (w⋆ t −w ⋆ i )⊤ tX k=2 E[P] t−k+1 w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 , 28 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [2]10 is explained above. To finalize the result, we sum all the terms we obtained. First, we define a= E P2 2 , b=∥E[P]∥ 2 , c= d...

  32. [32]

    (12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

    + 2bt (w⋆ t −w ⋆ i )⊤ (w0 −w ⋆ 1) + tX k=2 tX k′=2 at−max(k,k′)+1b|k−k′| w⋆ k−1 −w ⋆ k ⊤ w⋆ k′−1 −w ⋆ k′ + 2 tX k=2 bt−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ k +∥w ⋆ t −w ⋆ i ∥2 . (12) Finally, the theorem follows by substituting the above expression into E[G T ] = 1 T TX i=1 E h ∥wT −w ⋆ i ∥2 i

  33. [33]

    2 w⋆ s−1 −w ⋆ s ⊤ tX r=s w⋆ r−1 −w ⋆ r − w⋆ s−1 −w ⋆ s 2 # + 2 tX k=2 (1−α) t−k+1 (w⋆ t −w ⋆ i )⊤ w⋆ k−1 −w ⋆ t −(w ⋆ k −w ⋆ t ) +∥w ⋆ t −w ⋆ i ∥2 = 2 tX s=2 (1−α) t−s+1

    In general, fork̸=k ′ one obtains a two-case identity: ifk > k ′ thenE[S ⊤ t:kSt:k′ ] =E[P 2] t−k+1 E[P] k−k′ , while ifk < k ′ thenE[S ⊤ t:kSt:k′ ] =E[P] k′−k E[P2] t−k′+1, where the factorE[P] |k−k′| appears on the side of the longer product. In our settingE[P] =bIandE[P 2] =aI(hence they commute), so both cases collapse to E[P2] t−(k∨k′)+1E[P] |k−k′|. ...

  34. [34]

    1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1

    = 2Tr (Σ)−2E h (w⋆ i )⊤ ξi i = 2Tr (Σ)−2E h (w⋆ +ξ i)⊤ ξi i = 2Tr (Σ)−2E h ∥ξi∥2 i | {z } Tr(Σ) = 0, 44 OPTIMALL2 REGULARIZATION INHIGH-DIMENSIONALCONTINUALLINEARREGRESSION where [1] follows by Eq. (10) where, wT = " 1Y m=T Pm # (w0 −w ⋆ 1)+ 1 λd TX k=1 " kY m=T Pm # X⊤ k zk + TX k=2 " kY m=T Pm # w⋆ k−1 −w ⋆ k +w ⋆ T . Recall the identity and Eq. (5), E ...

  35. [35]

    Moreover, ifα <1then d dλ J(t, λ) λ=0 <0, while ifα= 1there existsλ ′ >0such that d dλ J(t, λ) λ=λ′ <0. from Eq. (34), the Intermediate Value Theorem guarantees a pointλ ⋆ ∈(0,Λ] such that d dλ⋆ J(t, λ ⋆) = 0. On the compact intervalI= [0,Λ]ifα <1, andI= [ ¯λ,Λ]ifα= 1, the functionJ(t,·)attains a global minimum by Weierstrass, since the derivative is nega...

  36. [36]

    57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6

    An analytic function is a function locally given by a convergent power series. 57 KARPELMOROSHKOLEVINSTEINMEIRSOUDRYEVRON E.6. Proof of Theorem 6: Key Result on Optimal Regularization Scaling Recall Theorem 6.Under i.i.d. teachers (Assumption 1) with non-zero mean teacherw ⋆, the opti- mal fixed regularization strength that minimizes the expected generali...