PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks
Pith reviewed 2026-05-20 11:49 UTC · model grok-4.3
The pith
Finding Pareto-optimal solutions among task objectives yields minimal forgetting in continual learning of conflicting tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that continual learning of conflicting tasks reduces to locating Pareto-optimal points in the multi-objective space whose coordinates are the individual task losses. Such points are minimal-forgetting by construction: no other parameter vector can lower every task loss simultaneously. For quadratic problems the authors give iterative updates that achieve this property with static memory cost O(d squared) for d-dimensional models; the same construction extends to any loss possessing a quadratic upper bound.
What carries the argument
Pareto optimality over the vector of task-specific losses, which selects parameter vectors that cannot be dominated on all tasks at once and thereby encodes the minimal-forgetting compromise.
If this is right
- Explicit iterative algorithms exist for linear and basis-function regression that achieve the Pareto-minimal-forgetting property.
- The same construction applies to any loss admitting a quadratic upper bound, including logistic regression.
- Memory footprint remains O(d squared) and independent of the number of tasks for all quadratic cases.
- The framework supplies a principled replacement for methods that assume a shared global optimum across tasks.
Where Pith is reading between the lines
- The same Pareto construction could be tested on non-convex models by replacing exact Pareto search with approximate multi-objective gradient methods.
- Choice of which Pareto point to pick along the front may affect long-horizon performance when many tasks arrive in sequence.
- The approach suggests examining whether selective storage of only the most conflicting task gradients can further reduce the memory cost while preserving the Pareto guarantee.
Load-bearing premise
That a Pareto-optimal point chosen with respect to the current set of task losses directly yields the smallest possible forgetting in the sequential continual-learning sense when the tasks have no common global minimizer.
What would settle it
An explicit counter-example in which a non-Pareto parameter vector achieves strictly lower loss on every previous task than the Pareto solution selected by the algorithm would falsify the minimal-forgetting claim.
Figures
read the original abstract
In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of $\mathcal{O}(d^2)$ for models with $d$ parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PMF-CL, a continual learning framework that recasts catastrophic forgetting as a multi-task optimization problem and selects Pareto-optimal parameter vectors with respect to the vector of task losses. It claims to derive exact or approximate update rules for linear regression, basis-function regression, and losses admitting quadratic upper bounds (e.g., logistic regression), together with an O(d²) static-memory iterative procedure that maintains the Pareto property without revisiting past data.
Significance. If the derivations are correct and the memory-constrained iterates remain on (or sufficiently close to) the true Pareto front, the work supplies a principled, non-heuristic link between continual learning and multi-task Pareto optimality. The explicit O(d²) memory bound and the treatment of conflicting tasks without a common minimizer are concrete strengths that could inform future algorithm design.
major comments (2)
- [§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.
- [§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.
minor comments (2)
- Notation for the Pareto front and the forgetting measure should be introduced once and used consistently; several paragraphs switch between “Pareto-minimal forgetting” and “minimal forgetting in the Pareto sense” without explicit cross-reference.
- The experimental section would benefit from an ablation that isolates the effect of the memory footprint versus the choice of Pareto solver; current tables report aggregate accuracy but do not separate these factors.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. The observations help us strengthen the theoretical grounding of the PMF-CL framework. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (linear-regression case): the update rule is derived from the Pareto-stationarity condition on the joint loss vector, yet the manuscript does not prove that the finite-memory summary statistics suffice to reach a point on the true (non-approximated) Pareto front when the tasks have no common minimizer; the sub-optimality gap with respect to the forgetting metric used in the experiments is therefore left unquantified.
Authors: For the linear-regression case the per-task losses are exactly quadratic, so the maintained summary statistics (Gram matrix of features and cross terms with targets) permit exact evaluation of the Pareto-stationarity condition without revisiting past data. We acknowledge that the current manuscript does not contain an explicit theorem proving that the resulting memory-efficient iterates lie on the true Pareto front rather than an approximation thereof. We will add such a theorem together with its proof in the revised manuscript. We will also report a quantitative bound (or empirical gap) on the forgetting metric relative to an oracle that retains all data. revision: yes
-
Referee: [§4] §4 (general quadratic-upper-bound losses): the local quadratic approximation or dual-variable update is stated to preserve Pareto minimality, but the argument appears to rely on the same summary statistics as the linear case; a concrete counter-example or bound showing when the approximation error causes the iterate to be dominated by a feasible point attainable only with full data would strengthen the central claim.
Authors: In §4 the quadratic upper bound is used to obtain an approximate Pareto update that re-uses the same O(d²) summary statistics. We agree that the manuscript would benefit from a more precise characterization of the approximation error. In the revision we will supply (i) an explicit error bound expressed in terms of the tightness of the quadratic upper bound and (ii) a concrete counter-example demonstrating a case in which the approximate iterate is dominated by a point reachable only with full data, together with sufficient conditions under which the error remains controlled. revision: yes
Circularity Check
Pareto optimality equated to minimal forgetting by definition without independent derivation
specific steps
-
self definitional
[Abstract]
"The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense."
The text explicitly equates the Pareto-optimal set with minimal forgetting via the phrase 'by definition,' rendering the claimed 'principled' minimal-forgetting property tautological with the multi-task Pareto framing rather than independently derived from the continual-learning update rules or memory constraints.
full rationale
The paper's foundational claim defines Pareto-optimal solutions as those that 'by definition, minimally forget the previous tasks in the Pareto sense.' This self-definitional step makes the central result a restatement of the chosen multi-objective framing rather than a derived property of the sequential updates. The O(d²) memory iterative update for quadratic losses is presented as maintaining the Pareto property, yet relies on local approximations that do not provably recover the true joint Pareto front when tasks lack a common minimizer. No external benchmark or non-circular proof is supplied to separate the definition from the forgetting metric, yielding moderate circularity confined to the core premise.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tasks rarely have a common global minimizer
- standard math Pareto-optimal solutions exist for the multi-task objectives
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions... For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of O(d²)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 5.2 (Iterative PMF-CL for QUB Loss)... A_k Δθ_k = α_k H_k (θ_min_k − θ⋆_{k−1})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V
Bell, J., Quarantiello, L., Coleman, E. N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V . The future of continual learning in the era of foundation models: Three key directions.arXiv preprint arXiv:2506.03320,
-
[2]
Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,
Benzing, F. Unifying regularisation methods for continual learning.arXiv preprint arXiv:2006.06357,
-
[3]
On Tiny Episodic Memories in Continual Learning
Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-GEM. International Conference on Learning Representations, 2019a. URL https://openreview. net/forum?id=Hkf2_sC5FX. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. On tiny episodic memories in continual learnin...
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[4]
doi: https://doi.org/10.1016/j.inffus.2022.07.024
ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2022.07.024. URL https://www. sciencedirect.com/science/article/pii/S1566253522000884. Désidéri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization.Comptes Rendus Mathematique, 350(5-6):313–318,
-
[5]
Evron, I., Levinstein, R., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Srebro, N. From continual learning to sgd and back: Better rates for continual linear models.arXiv preprint arXiv:2504.04579,
-
[6]
Federated learning meets multi-objective optimization
Hu, Z., Shaloudegi, K., Zhang, G., and Yu, Y . Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051,
work page 2039
-
[7]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
Keshri, S. K., Shah, N., and Prasad, R. On the convergence of continual federated learning using incrementally aggregated gradients.arXiv preprint arXiv:2411.07959,
-
[9]
doi: 10.1073/pnas.1611835114. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1611835114. Knoblauch, J., Husain, H., and Diethe, T. Optimal continual learning has perfect memory and is np-hard. InInternational Conference on Machine Learning, pp. 5327–5337. PMLR,
-
[10]
URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ
ISBN 9780943494029. URLhttps://books.google.com/books?id=ZiLYAAAAMAAJ. Levinstein, R., Attia, A., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Evron, I. Optimal rates in continual linear regression via increasing regularization.arXiv preprint arXiv:2506.06501,
- [11]
-
[12]
Communication-Efficient Learning of Deep Networks from Decentralized Data,
11 McMahan, H. B., Moore, E., Ramage, D., and y Arcas, B. A. Federated learning of deep networks using model averaging.CoRR, abs/1602.05629,
-
[13]
doi: 10.1007/978-3-540-88908-3_1
ISBN 978-3-540-88908-3. doi: 10.1007/978-3-540-88908-3_1. URL https://doi.org/10.1007/ 978-3-540-88908-3_1. Motulsky, H. and Christopoulos, A.Fitting Models to Biological Data Using Linear and Non- linear Regression: A Practical Guide to Curve Fitting. Oxford University Press,
-
[14]
Learning Transferable Visual Models From Natural Language Supervision
URLhttps://arxiv.org/abs/2103.00020. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URLhttps://arxiv.org/abs/2204.06125. Ray, P. P. Timely need for navigating the potential and downsides of llms in healthcare and biomedicine.Briefings in bioinformatics, 25(3):bbae214,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
doi: https://doi.org/10.1016/0921-8890(95)00004-Y
ISSN 0921-8890. doi: https://doi.org/10.1016/0921-8890(95)00004-Y. URL https: //www.sciencedirect.com/science/article/pii/092188909500004Y. The Biology and Technology of Intelligent Autonomous Agents. 12 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A....
-
[17]
LLaMA: Open and Efficient Foundation Language Models
URLhttps://arxiv.org/abs/2302.13971. Usmanova, A., Portet, F., Lalanda, P., and Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services.arXiv preprint arXiv:2109.04197,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Fed- erated continual learning for edge-ai: A comprehensive survey,
Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024a. Wang, Q., Liu, B., and Li, Y . Traceable federated continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...
-
[19]
13 A Extended Literature Review Foundations of Continual Learning.The field of CL, or lifelong learning, emerged from early research in robotic agents and task rehearsal (Thrun & Mitchell, 1995; Thrun, 1995; Silver & Mercer, 2002). It is deeply rooted in cognitive science, as the ability to consolidate new information without overwriting existing knowledg...
work page 1995
-
[20]
B.1.2 MSI for Linear Regression (Lemma 4.1) Lemma B.2(MSI for Linear Regression).Using the eSVD of X t, we can write the loss function for linear regression as: Lt(θ) = 1 2nt V e t ⊤θ− eyt 2 Σe t + min θ′∈W Lt(θ′)(20) where we use the notation ∥z∥2 A :=∥Az∥ 2 2,ey=Σ e t −1U e t yt. Thus, the MSI for linear regression is Ilin t := (V e t ,Σ e t ,ey), with ...
work page 2001
-
[21]
that will be useful to derive the Hessian upper bound for multi-class logistic regression. Lemma E.1(Böhning (1992)).For a probability vector p∈∆ K−1, the matrix V(p) := Diag(p)− pp⊤ is upper bounded by: V:= 1 2 I K − 1 K 11⊤ ,(63) where1is the all-ones vector of sizeK. In other words V−V(p)is positive semi-definite. Substituting the boundV(p)⪯ Vinto (62)...
work page 1992
-
[22]
4:fork= 1toTdo 5:foreach clienti= 1tomin parallel do 6:Computeθ min k,i andH k,i based onD k,i. 7:UpdateN k,i =N k−1,i +n k,i. 8:UpdateB k,i = Nk−1,i Nk,i Bk−1,i +α (k) k,i H t,i. 9:Computeγ k,i =α (k) k,i H k,i θmin k,i −θ ⋆ k−1 . 10:SendN k,i,B k,i andγ k,i to the central server. 11:end for 12:At server,computepreferencesβ (k) i for alli∈[m]based on{N k...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.