PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

Atilla Eryilmaz; Jia (Kevin) Liu; Srijith Nair

arxiv: 2605.19145 · v1 · pith:VN46LWVBnew · submitted 2026-05-18 · 💻 cs.LG

PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

Srijith Nair , Atilla Eryilmaz , Jia (Kevin) Liu This is my paper

Pith reviewed 2026-05-20 11:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningcatastrophic forgettingPareto optimalitymulti-task learningconflicting tasksregressionminimal forgettingquadratic bounds

0 comments

The pith

Finding Pareto-optimal solutions among task objectives yields minimal forgetting in continual learning of conflicting tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When new tasks conflict with previously learned ones, no single model can excel at every task at once. The paper shows that treating each task's loss as a separate objective and selecting a Pareto-optimal point among them produces a solution that cannot be improved on any past task without harming another. This replaces the unrealistic hope of a shared global minimizer with a concrete trade-off that, by definition, forgets prior tasks as little as possible in the vector sense. The authors then derive concrete update rules for linear regression, basis-function regression, and losses that admit quadratic upper bounds, all while keeping memory use fixed at quadratic in the model dimension. A reader would care because the method supplies a systematic, memory-bounded recipe for handling inevitable conflicts rather than ad-hoc replay or penalties.

Core claim

The central claim is that continual learning of conflicting tasks reduces to locating Pareto-optimal points in the multi-objective space whose coordinates are the individual task losses. Such points are minimal-forgetting by construction: no other parameter vector can lower every task loss simultaneously. For quadratic problems the authors give iterative updates that achieve this property with static memory cost O(d squared) for d-dimensional models; the same construction extends to any loss possessing a quadratic upper bound.

What carries the argument

Pareto optimality over the vector of task-specific losses, which selects parameter vectors that cannot be dominated on all tasks at once and thereby encodes the minimal-forgetting compromise.

If this is right

Explicit iterative algorithms exist for linear and basis-function regression that achieve the Pareto-minimal-forgetting property.
The same construction applies to any loss admitting a quadratic upper bound, including logistic regression.
Memory footprint remains O(d squared) and independent of the number of tasks for all quadratic cases.
The framework supplies a principled replacement for methods that assume a shared global optimum across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Pareto construction could be tested on non-convex models by replacing exact Pareto search with approximate multi-objective gradient methods.
Choice of which Pareto point to pick along the front may affect long-horizon performance when many tasks arrive in sequence.
The approach suggests examining whether selective storage of only the most conflicting task gradients can further reduce the memory cost while preserving the Pareto guarantee.

Load-bearing premise

That a Pareto-optimal point chosen with respect to the current set of task losses directly yields the smallest possible forgetting in the sequential continual-learning sense when the tasks have no common global minimizer.

What would settle it

An explicit counter-example in which a non-Pareto parameter vector achieves strictly lower loss on every previous task than the Pareto solution selected by the algorithm would falsify the minimal-forgetting claim.

Figures

Figures reproduced from arXiv: 2605.19145 by Atilla Eryilmaz, Jia (Kevin) Liu, Srijith Nair.

**Figure 2.** Figure 2: Illustration of the Pareto Fronts (PF) and a 1-D visualization in parameter domain for a pair [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Performance metrics across 10 sequential tasks. (Red = PMF-CL, Blue = SGD, Green = [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Computational trade-offs showing (a) total memory footprint and (b) execution time per [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of $\mathcal{O}(d^2)$ for models with $d$ parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PMF-CL frames continual learning as Pareto optimization over task losses and derives O(d^2) memory updates for linear and quadratically bounded cases, but the sequential updates may not reach the true joint Pareto front when tasks conflict and old data is unavailable.

read the letter

The main takeaway is that this paper connects continual learning to multi-task Pareto optimality, arguing that solutions on the Pareto front of the vector of task losses are the ones that minimally forget prior tasks by definition. They work out explicit algorithms for linear regression and losses with quadratic upper bounds, using iterative updates that hold memory fixed at O(d^2) for d-dimensional models. That framing and the memory claim are the concrete pieces that stand out from the abstract and the stress-test note. The work does a decent job of moving away from purely heuristic replay or regularization toward a multi-objective view that might organize how we think about conflicting objectives in regression-style continual settings. The static memory footprint is a practical detail worth testing. The soft spot is the one the stress-test flags: when tasks lack a common minimizer, any finite summary of past losses is incomplete, so the update after seeing only the new task can land at a point dominated by some other feasible solution that would have been available with full data. The paper needs to show either that the approximation still converges to the exact front or that the extra forgetting stays bounded in the metrics they care about. Without those steps the central claim risks being more definitional than operational. This is for people working on theoretical continual learning or on memory-efficient methods for regression and convex-ish losses. A reader who wants a fresh reduction of the forgetting problem could get something useful out of the derivations and updates. It shows enough structure and engagement with the literature to deserve a serious referee, even if the proofs and experiments will need work.

Referee Report

2 major / 2 minor

Summary. The paper proposes PMF-CL, a continual learning framework that recasts catastrophic forgetting as a multi-task optimization problem and selects Pareto-optimal parameter vectors with respect to the vector of task losses. It claims to derive exact or approximate update rules for linear regression, basis-function regression, and losses admitting quadratic upper bounds (e.g., logistic regression), together with an O(d²) static-memory iterative procedure that maintains the Pareto property without revisiting past data.

Significance. If the derivations are correct and the memory-constrained iterates remain on (or sufficiently close to) the true Pareto front, the work supplies a principled, non-heuristic link between continual learning and multi-task Pareto optimality. The explicit O(d²) memory bound and the treatment of conflicting tasks without a common minimizer are concrete strengths that could inform future algorithm design.

major comments (2)

[§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.
[§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.

minor comments (2)

Notation for the Pareto front and the forgetting measure should be introduced once and used consistently; several paragraphs switch between “Pareto-minimal forgetting” and “minimal forgetting in the Pareto sense” without explicit cross-reference.
The experimental section would benefit from an ablation that isolates the effect of the memory footprint versus the choice of Pareto solver; current tables report aggregate accuracy but do not separate these factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The observations help us strengthen the theoretical grounding of the PMF-CL framework. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.

Authors: For the linear-regression case the per-task losses are exactly quadratic, so the maintained summary statistics (Gram matrix of features and cross terms with targets) permit exact evaluation of the Pareto-stationarity condition without revisiting past data. We acknowledge that the current manuscript does not contain an explicit theorem proving that the resulting memory-efficient iterates lie on the true Pareto front rather than an approximation thereof. We will add such a theorem together with its proof in the revised manuscript. We will also report a quantitative bound (or empirical gap) on the forgetting metric relative to an oracle that retains all data. revision: yes
Referee: [§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.

Authors: In §4 the quadratic upper bound is used to obtain an approximate Pareto update that re-uses the same O(d²) summary statistics. We agree that the manuscript would benefit from a more precise characterization of the approximation error. In the revision we will supply (i) an explicit error bound expressed in terms of the tightness of the quadratic upper bound and (ii) a concrete counter-example demonstrating a case in which the approximate iterate is dominated by a point reachable only with full data, together with sufficient conditions under which the error remains controlled. revision: yes

Circularity Check

1 steps flagged

Pareto optimality equated to minimal forgetting by definition without independent derivation

specific steps

self definitional [Abstract]
"The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense."

The text explicitly equates the Pareto-optimal set with minimal forgetting via the phrase 'by definition,' rendering the claimed 'principled' minimal-forgetting property tautological with the multi-task Pareto framing rather than independently derived from the continual-learning update rules or memory constraints.

full rationale

The paper's foundational claim defines Pareto-optimal solutions as those that 'by definition, minimally forget the previous tasks in the Pareto sense.' This self-definitional step makes the central result a restatement of the chosen multi-objective framing rather than a derived property of the sequential updates. The O(d²) memory iterative update for quadratic losses is presented as maintaining the Pareto property, yet relies on local approximations that do not provably recover the true joint Pareto front when tasks lack a common minimizer. No external benchmark or non-circular proof is supplied to separate the definition from the forgetting metric, yielding moderate circularity confined to the core premise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that tasks lack a common global minimizer and on the standard mathematical notion of Pareto optimality; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Tasks rarely have a common global minimizer
Explicitly stated as the practical regime where forgetting is inevitable.
standard math Pareto-optimal solutions exist for the multi-task objectives
Invoked to define minimal forgetting in the Pareto sense.

pith-pipeline@v0.9.0 · 5766 in / 1256 out tokens · 49719 ms · 2026-05-20T11:49:42.971003+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions... For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of O(d²)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.2 (Iterative PMF-CL for QUB Loss)... A_k Δθ_k = α_k H_k (θ_min_k − θ⋆_{k−1})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

[1]

N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V

Bell, J., Quarantiello, L., Coleman, E. N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V . The future of continual learning in the era of foundation models: Three key directions.arXiv preprint arXiv:2506.03320,

work page arXiv
[2]

Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

Benzing, F. Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

work page arXiv 2006
[3]

On Tiny Episodic Memories in Continual Learning

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-GEM. International Conference on Learning Representations, 2019a. URL https://openreview. net/forum?id=Hkf2_sC5FX. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. On tiny episodic memories in continual learnin...

work page internal anchor Pith review Pith/arXiv arXiv 1902
[4]

doi: https://doi.org/10.1016/j.inffus.2022.07.024

ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2022.07.024. URL https://www. sciencedirect.com/science/article/pii/S1566253522000884. Désidéri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization.Comptes Rendus Mathematique, 350(5-6):313–318,

work page doi:10.1016/j.inffus.2022.07.024 2022
[5]

From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

Evron, I., Levinstein, R., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Srebro, N. From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

work page arXiv
[6]

Federated learning meets multi-objective optimization

Hu, Z., Shaloudegi, K., Zhang, G., and Yu, Y . Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051,

work page 2039
[7]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

K., Shah, N., and Prasad, R

Keshri, S. K., Shah, N., and Prasad, R. On the convergence of continual federated learning using incrementally aggregated gradients.arXiv preprint arXiv:2411.07959,

work page arXiv
[9]

doi: 10.1073/pnas.1611835114

doi: 10.1073/pnas.1611835114. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1611835114. Knoblauch, J., Husain, H., and Diethe, T. Optimal continual learning has perfect memory and is np-hard. InInternational Conference on Machine Learning, pp. 5327–5337. PMLR,

work page doi:10.1073/pnas.1611835114
[10]

URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ

ISBN 9780943494029. URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ. Levinstein, R., Attia, A., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Evron, I. Optimal rates in continual linear regression via increasing regularization.arXiv preprint arXiv:2506.06501,

work page arXiv
[11]

Liu, H., Long, M., Wang, J., and Jordan, M. I. Towards understanding the transferability of deep representations.arXiv preprint arXiv:1909.12031,

work page arXiv 1909
[12]

Communication-Efficient Learning of Deep Networks from Decentralized Data,

11 McMahan, H. B., Moore, E., Ramage, D., and y Arcas, B. A. Federated learning of deep networks using model averaging.CoRR, abs/1602.05629,

work page arXiv
[13]

doi: 10.1007/978-3-540-88908-3_1

ISBN 978-3-540-88908-3. doi: 10.1007/978-3-540-88908-3_1. URL https://doi.org/10.1007/ 978-3-540-88908-3_1. Motulsky, H. and Christopoulos, A.Fitting Models to Biological Data Using Linear and Non- linear Regression: A Practical Guide to Curve Fitting. Oxford University Press,

work page doi:10.1007/978-3-540-88908-3_1
[14]

Learning Transferable Visual Models From Natural Language Supervision

URLhttps://arxiv.org/abs/2103.00020. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URLhttps://arxiv.org/abs/2204.06125. Ray, P. P. Timely need for navigating the potential and downsides of llms in healthcare and biomedicine.Briefings in bioinformatics, 25(3):bbae214,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

doi: https://doi.org/10.1016/0921-8890(95)00004-Y

ISSN 0921-8890. doi: https://doi.org/10.1016/0921-8890(95)00004-Y. URL https: //www.sciencedirect.com/science/article/pii/092188909500004Y. The Biology and Technology of Intelligent Autonomous Agents. 12 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A....

work page doi:10.1016/0921-8890(95)00004-y
[17]

LLaMA: Open and Efficient Foundation Language Models

URLhttps://arxiv.org/abs/2302.13971. Usmanova, A., Portet, F., Lalanda, P., and Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services.arXiv preprint arXiv:2109.04197,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Fed- erated continual learning for edge-ai: A comprehensive survey,

Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024a. Wang, Q., Liu, B., and Li, Y . Traceable federated continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

work page arXiv
[19]

It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledge is a hallmark of human intelligence (French, 1999)

13 A Extended Literature Review Foundations of Continual Learning.The field of CL, or lifelong learning, emerged from early research in robotic agents and task rehearsal (Thrun & Mitchell, 1995; Thrun, 1995; Silver & Mercer, 2002). It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledg...

work page 1995
[20]

Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with memorysize(I lin t ) =r t(d+ 2)

B.1.2 MSI for Linear Regression (Lemma 4.1) Lemma B.2(MSI for Linear Regression).Using the eSVD of X t, we can write the loss function for linear regression as: Lt(θ) = 1 2nt V e t ⊤θ− eyt 2 Σe t + min θ′∈W Lt(θ′)(20) where we use the notation ∥z∥2 A :=∥Az∥ 2 2,ey=Σ e t −1U e t yt. Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with ...

work page 2001
[21]

Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK

that will be useful to derive the Hessian upper bound for multi-class logistic regression. Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK. In other words V−V(p)is positive semi-definite. Substituting the boundV(p)⪯ Vinto (62)...

work page 1992
[22]

7:UpdateN k,i =N k−1,i +n k,i

4:fork= 1toTdo 5:foreach clienti= 1tomin parallel do 6:Computeθ min k,i andH k,i based onD k,i. 7:UpdateN k,i =N k−1,i +n k,i. 8:UpdateB k,i = Nk−1,i Nk,i Bk−1,i +α (k) k,i H t,i. 9:Computeγ k,i =α (k) k,i H k,i θmin k,i −θ ⋆ k−1 . 10:SendN k,i,B k,i andγ k,i to the central server. 11:end for 12:At server,computepreferencesβ (k) i for alli∈[m]based on{N k...

work page 2021

[1] [1]

N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V

Bell, J., Quarantiello, L., Coleman, E. N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V . The future of continual learning in the era of foundation models: Three key directions.arXiv preprint arXiv:2506.03320,

work page arXiv

[2] [2]

Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

Benzing, F. Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

work page arXiv 2006

[3] [3]

On Tiny Episodic Memories in Continual Learning

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-GEM. International Conference on Learning Representations, 2019a. URL https://openreview. net/forum?id=Hkf2_sC5FX. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. On tiny episodic memories in continual learnin...

work page internal anchor Pith review Pith/arXiv arXiv 1902

[4] [4]

doi: https://doi.org/10.1016/j.inffus.2022.07.024

ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2022.07.024. URL https://www. sciencedirect.com/science/article/pii/S1566253522000884. Désidéri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization.Comptes Rendus Mathematique, 350(5-6):313–318,

work page doi:10.1016/j.inffus.2022.07.024 2022

[5] [5]

From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

Evron, I., Levinstein, R., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Srebro, N. From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

work page arXiv

[6] [6]

Federated learning meets multi-objective optimization

Hu, Z., Shaloudegi, K., Zhang, G., and Yu, Y . Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051,

work page 2039

[7] [7]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

K., Shah, N., and Prasad, R

Keshri, S. K., Shah, N., and Prasad, R. On the convergence of continual federated learning using incrementally aggregated gradients.arXiv preprint arXiv:2411.07959,

work page arXiv

[9] [9]

doi: 10.1073/pnas.1611835114

doi: 10.1073/pnas.1611835114. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1611835114. Knoblauch, J., Husain, H., and Diethe, T. Optimal continual learning has perfect memory and is np-hard. InInternational Conference on Machine Learning, pp. 5327–5337. PMLR,

work page doi:10.1073/pnas.1611835114

[10] [10]

URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ

ISBN 9780943494029. URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ. Levinstein, R., Attia, A., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Evron, I. Optimal rates in continual linear regression via increasing regularization.arXiv preprint arXiv:2506.06501,

work page arXiv

[11] [11]

Liu, H., Long, M., Wang, J., and Jordan, M. I. Towards understanding the transferability of deep representations.arXiv preprint arXiv:1909.12031,

work page arXiv 1909

[12] [12]

Communication-Efficient Learning of Deep Networks from Decentralized Data,

11 McMahan, H. B., Moore, E., Ramage, D., and y Arcas, B. A. Federated learning of deep networks using model averaging.CoRR, abs/1602.05629,

work page arXiv

[13] [13]

doi: 10.1007/978-3-540-88908-3_1

ISBN 978-3-540-88908-3. doi: 10.1007/978-3-540-88908-3_1. URL https://doi.org/10.1007/ 978-3-540-88908-3_1. Motulsky, H. and Christopoulos, A.Fitting Models to Biological Data Using Linear and Non- linear Regression: A Practical Guide to Curve Fitting. Oxford University Press,

work page doi:10.1007/978-3-540-88908-3_1

[14] [14]

Learning Transferable Visual Models From Natural Language Supervision

URLhttps://arxiv.org/abs/2103.00020. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

URLhttps://arxiv.org/abs/2204.06125. Ray, P. P. Timely need for navigating the potential and downsides of llms in healthcare and biomedicine.Briefings in bioinformatics, 25(3):bbae214,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

doi: https://doi.org/10.1016/0921-8890(95)00004-Y

ISSN 0921-8890. doi: https://doi.org/10.1016/0921-8890(95)00004-Y. URL https: //www.sciencedirect.com/science/article/pii/092188909500004Y. The Biology and Technology of Intelligent Autonomous Agents. 12 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A....

work page doi:10.1016/0921-8890(95)00004-y

[17] [17]

LLaMA: Open and Efficient Foundation Language Models

URLhttps://arxiv.org/abs/2302.13971. Usmanova, A., Portet, F., Lalanda, P., and Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services.arXiv preprint arXiv:2109.04197,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Fed- erated continual learning for edge-ai: A comprehensive survey,

Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024a. Wang, Q., Liu, B., and Li, Y . Traceable federated continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

work page arXiv

[19] [19]

It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledge is a hallmark of human intelligence (French, 1999)

13 A Extended Literature Review Foundations of Continual Learning.The field of CL, or lifelong learning, emerged from early research in robotic agents and task rehearsal (Thrun & Mitchell, 1995; Thrun, 1995; Silver & Mercer, 2002). It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledg...

work page 1995

[20] [20]

Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with memorysize(I lin t ) =r t(d+ 2)

B.1.2 MSI for Linear Regression (Lemma 4.1) Lemma B.2(MSI for Linear Regression).Using the eSVD of X t, we can write the loss function for linear regression as: Lt(θ) = 1 2nt V e t ⊤θ− eyt 2 Σe t + min θ′∈W Lt(θ′)(20) where we use the notation ∥z∥2 A :=∥Az∥ 2 2,ey=Σ e t −1U e t yt. Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with ...

work page 2001

[21] [21]

Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK

that will be useful to derive the Hessian upper bound for multi-class logistic regression. Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK. In other words V−V(p)is positive semi-definite. Substituting the boundV(p)⪯ Vinto (62)...

work page 1992

[22] [22]

7:UpdateN k,i =N k−1,i +n k,i

4:fork= 1toTdo 5:foreach clienti= 1tomin parallel do 6:Computeθ min k,i andH k,i based onD k,i. 7:UpdateN k,i =N k−1,i +n k,i. 8:UpdateB k,i = Nk−1,i Nk,i Bk−1,i +α (k) k,i H t,i. 9:Computeγ k,i =α (k) k,i H k,i θmin k,i −θ ⋆ k−1 . 10:SendN k,i,B k,i andγ k,i to the central server. 11:end for 12:At server,computepreferencesβ (k) i for alli∈[m]based on{N k...

work page 2021