pith. sign in

arxiv: 2605.19145 · v1 · pith:VN46LWVBnew · submitted 2026-05-18 · 💻 cs.LG

PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

Pith reviewed 2026-05-20 11:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningcatastrophic forgettingPareto optimalitymulti-task learningconflicting tasksregressionminimal forgettingquadratic bounds
0
0 comments X

The pith

Finding Pareto-optimal solutions among task objectives yields minimal forgetting in continual learning of conflicting tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When new tasks conflict with previously learned ones, no single model can excel at every task at once. The paper shows that treating each task's loss as a separate objective and selecting a Pareto-optimal point among them produces a solution that cannot be improved on any past task without harming another. This replaces the unrealistic hope of a shared global minimizer with a concrete trade-off that, by definition, forgets prior tasks as little as possible in the vector sense. The authors then derive concrete update rules for linear regression, basis-function regression, and losses that admit quadratic upper bounds, all while keeping memory use fixed at quadratic in the model dimension. A reader would care because the method supplies a systematic, memory-bounded recipe for handling inevitable conflicts rather than ad-hoc replay or penalties.

Core claim

The central claim is that continual learning of conflicting tasks reduces to locating Pareto-optimal points in the multi-objective space whose coordinates are the individual task losses. Such points are minimal-forgetting by construction: no other parameter vector can lower every task loss simultaneously. For quadratic problems the authors give iterative updates that achieve this property with static memory cost O(d squared) for d-dimensional models; the same construction extends to any loss possessing a quadratic upper bound.

What carries the argument

Pareto optimality over the vector of task-specific losses, which selects parameter vectors that cannot be dominated on all tasks at once and thereby encodes the minimal-forgetting compromise.

If this is right

  • Explicit iterative algorithms exist for linear and basis-function regression that achieve the Pareto-minimal-forgetting property.
  • The same construction applies to any loss admitting a quadratic upper bound, including logistic regression.
  • Memory footprint remains O(d squared) and independent of the number of tasks for all quadratic cases.
  • The framework supplies a principled replacement for methods that assume a shared global optimum across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Pareto construction could be tested on non-convex models by replacing exact Pareto search with approximate multi-objective gradient methods.
  • Choice of which Pareto point to pick along the front may affect long-horizon performance when many tasks arrive in sequence.
  • The approach suggests examining whether selective storage of only the most conflicting task gradients can further reduce the memory cost while preserving the Pareto guarantee.

Load-bearing premise

That a Pareto-optimal point chosen with respect to the current set of task losses directly yields the smallest possible forgetting in the sequential continual-learning sense when the tasks have no common global minimizer.

What would settle it

An explicit counter-example in which a non-Pareto parameter vector achieves strictly lower loss on every previous task than the Pareto solution selected by the algorithm would falsify the minimal-forgetting claim.

Figures

Figures reproduced from arXiv: 2605.19145 by Atilla Eryilmaz, Jia (Kevin) Liu, Srijith Nair.

Figure 1
Figure 1. Figure 1: The task setup in ICL vs PMF-CL visualized in the task-loss domain (left) and parameter [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Pareto Fronts (PF) and a 1-D visualization in parameter domain for a pair [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance metrics across 10 sequential tasks. (Red = PMF-CL, Blue = SGD, Green = [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Computational trade-offs showing (a) total memory footprint and (b) execution time per [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
read the original abstract

In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of $\mathcal{O}(d^2)$ for models with $d$ parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PMF-CL, a continual learning framework that recasts catastrophic forgetting as a multi-task optimization problem and selects Pareto-optimal parameter vectors with respect to the vector of task losses. It claims to derive exact or approximate update rules for linear regression, basis-function regression, and losses admitting quadratic upper bounds (e.g., logistic regression), together with an O(d²) static-memory iterative procedure that maintains the Pareto property without revisiting past data.

Significance. If the derivations are correct and the memory-constrained iterates remain on (or sufficiently close to) the true Pareto front, the work supplies a principled, non-heuristic link between continual learning and multi-task Pareto optimality. The explicit O(d²) memory bound and the treatment of conflicting tasks without a common minimizer are concrete strengths that could inform future algorithm design.

major comments (2)
  1. [§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.
  2. [§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.
minor comments (2)
  1. Notation for the Pareto front and the forgetting measure should be introduced once and used consistently; several paragraphs switch between “Pareto-minimal forgetting” and “minimal forgetting in the Pareto sense” without explicit cross-reference.
  2. The experimental section would benefit from an ablation that isolates the effect of the memory footprint versus the choice of Pareto solver; current tables report aggregate accuracy but do not separate these factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The observations help us strengthen the theoretical grounding of the PMF-CL framework. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.

    Authors: For the linear-regression case the per-task losses are exactly quadratic, so the maintained summary statistics (Gram matrix of features and cross terms with targets) permit exact evaluation of the Pareto-stationarity condition without revisiting past data. We acknowledge that the current manuscript does not contain an explicit theorem proving that the resulting memory-efficient iterates lie on the true Pareto front rather than an approximation thereof. We will add such a theorem together with its proof in the revised manuscript. We will also report a quantitative bound (or empirical gap) on the forgetting metric relative to an oracle that retains all data. revision: yes

  2. Referee: [§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.

    Authors: In §4 the quadratic upper bound is used to obtain an approximate Pareto update that re-uses the same O(d²) summary statistics. We agree that the manuscript would benefit from a more precise characterization of the approximation error. In the revision we will supply (i) an explicit error bound expressed in terms of the tightness of the quadratic upper bound and (ii) a concrete counter-example demonstrating a case in which the approximate iterate is dominated by a point reachable only with full data, together with sufficient conditions under which the error remains controlled. revision: yes

Circularity Check

1 steps flagged

Pareto optimality equated to minimal forgetting by definition without independent derivation

specific steps
  1. self definitional [Abstract]
    "The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense."

    The text explicitly equates the Pareto-optimal set with minimal forgetting via the phrase 'by definition,' rendering the claimed 'principled' minimal-forgetting property tautological with the multi-task Pareto framing rather than independently derived from the continual-learning update rules or memory constraints.

full rationale

The paper's foundational claim defines Pareto-optimal solutions as those that 'by definition, minimally forget the previous tasks in the Pareto sense.' This self-definitional step makes the central result a restatement of the chosen multi-objective framing rather than a derived property of the sequential updates. The O(d²) memory iterative update for quadratic losses is presented as maintaining the Pareto property, yet relies on local approximations that do not provably recover the true joint Pareto front when tasks lack a common minimizer. No external benchmark or non-circular proof is supplied to separate the definition from the forgetting metric, yielding moderate circularity confined to the core premise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that tasks lack a common global minimizer and on the standard mathematical notion of Pareto optimality; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Tasks rarely have a common global minimizer
    Explicitly stated as the practical regime where forgetting is inevitable.
  • standard math Pareto-optimal solutions exist for the multi-task objectives
    Invoked to define minimal forgetting in the Pareto sense.

pith-pipeline@v0.9.0 · 5766 in / 1256 out tokens · 49719 ms · 2026-05-20T11:49:42.971003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V

    Bell, J., Quarantiello, L., Coleman, E. N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V . The future of continual learning in the era of foundation models: Three key directions.arXiv preprint arXiv:2506.03320,

  2. [2]

    Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

    Benzing, F. Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,

  3. [3]

    On Tiny Episodic Memories in Continual Learning

    Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-GEM. International Conference on Learning Representations, 2019a. URL https://openreview. net/forum?id=Hkf2_sC5FX. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. On tiny episodic memories in continual learnin...

  4. [4]

    doi: https://doi.org/10.1016/j.inffus.2022.07.024

    ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2022.07.024. URL https://www. sciencedirect.com/science/article/pii/S1566253522000884. Désidéri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization.Comptes Rendus Mathematique, 350(5-6):313–318,

  5. [5]

    From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

    Evron, I., Levinstein, R., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Srebro, N. From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,

  6. [6]

    Federated learning meets multi-objective optimization

    Hu, Z., Shaloudegi, K., Zhang, G., and Yu, Y . Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051,

  7. [7]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  8. [8]

    K., Shah, N., and Prasad, R

    Keshri, S. K., Shah, N., and Prasad, R. On the convergence of continual federated learning using incrementally aggregated gradients.arXiv preprint arXiv:2411.07959,

  9. [9]

    doi: 10.1073/pnas.1611835114

    doi: 10.1073/pnas.1611835114. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1611835114. Knoblauch, J., Husain, H., and Diethe, T. Optimal continual learning has perfect memory and is np-hard. InInternational Conference on Machine Learning, pp. 5327–5337. PMLR,

  10. [10]

    URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ

    ISBN 9780943494029. URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ. Levinstein, R., Attia, A., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Evron, I. Optimal rates in continual linear regression via increasing regularization.arXiv preprint arXiv:2506.06501,

  11. [11]

    Liu, H., Long, M., Wang, J., and Jordan, M. I. Towards understanding the transferability of deep representations.arXiv preprint arXiv:1909.12031,

  12. [12]

    Communication-Efficient Learning of Deep Networks from Decentralized Data,

    11 McMahan, H. B., Moore, E., Ramage, D., and y Arcas, B. A. Federated learning of deep networks using model averaging.CoRR, abs/1602.05629,

  13. [13]

    doi: 10.1007/978-3-540-88908-3_1

    ISBN 978-3-540-88908-3. doi: 10.1007/978-3-540-88908-3_1. URL https://doi.org/10.1007/ 978-3-540-88908-3_1. Motulsky, H. and Christopoulos, A.Fitting Models to Biological Data Using Linear and Non- linear Regression: A Practical Guide to Curve Fitting. Oxford University Press,

  14. [14]

    Learning Transferable Visual Models From Natural Language Supervision

    URLhttps://arxiv.org/abs/2103.00020. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents,

  15. [15]

    URLhttps://arxiv.org/abs/2204.06125. Ray, P. P. Timely need for navigating the potential and downsides of llms in healthcare and biomedicine.Briefings in bioinformatics, 25(3):bbae214,

  16. [16]

    doi: https://doi.org/10.1016/0921-8890(95)00004-Y

    ISSN 0921-8890. doi: https://doi.org/10.1016/0921-8890(95)00004-Y. URL https: //www.sciencedirect.com/science/article/pii/092188909500004Y. The Biology and Technology of Intelligent Autonomous Agents. 12 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A....

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    URLhttps://arxiv.org/abs/2302.13971. Usmanova, A., Portet, F., Lalanda, P., and Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services.arXiv preprint arXiv:2109.04197,

  18. [18]

    Fed- erated continual learning for edge-ai: A comprehensive survey,

    Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024a. Wang, Q., Liu, B., and Li, Y . Traceable federated continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

  19. [19]

    It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledge is a hallmark of human intelligence (French, 1999)

    13 A Extended Literature Review Foundations of Continual Learning.The field of CL, or lifelong learning, emerged from early research in robotic agents and task rehearsal (Thrun & Mitchell, 1995; Thrun, 1995; Silver & Mercer, 2002). It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledg...

  20. [20]

    Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with memorysize(I lin t ) =r t(d+ 2)

    B.1.2 MSI for Linear Regression (Lemma 4.1) Lemma B.2(MSI for Linear Regression).Using the eSVD of X t, we can write the loss function for linear regression as: Lt(θ) = 1 2nt V e t ⊤θ− eyt 2 Σe t + min θ′∈W Lt(θ′)(20) where we use the notation ∥z∥2 A :=∥Az∥ 2 2,ey=Σ e t −1U e t yt. Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with ...

  21. [21]

    Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK

    that will be useful to derive the Hessian upper bound for multi-class logistic regression. Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK. In other words V−V(p)is positive semi-definite. Substituting the boundV(p)⪯ Vinto (62)...

  22. [22]

    7:UpdateN k,i =N k−1,i +n k,i

    4:fork= 1toTdo 5:foreach clienti= 1tomin parallel do 6:Computeθ min k,i andH k,i based onD k,i. 7:UpdateN k,i =N k−1,i +n k,i. 8:UpdateB k,i = Nk−1,i Nk,i Bk−1,i +α (k) k,i H t,i. 9:Computeγ k,i =α (k) k,i H k,i θmin k,i −θ ⋆ k−1 . 10:SendN k,i,B k,i andγ k,i to the central server. 11:end for 12:At server,computepreferencesβ (k) i for alli∈[m]based on{N k...