pith. machine review for the scientific record. sign in

arxiv: 2605.08311 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords continual learningmodel mergingtrajectory regularizationstorage efficiencygradient dynamicstask alignmentmulti-task models
0
0 comments X

The pith

Model merging in continual learning can proceed without storing prior models or data by optimizing in an augmented trajectory subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing merging approaches in continual learning prioritize global alignment, which accumulates task-specific errors over sequential data and produces vanishing gradients that stall optimization when a new task begins. This leaves the combined model in a poor starting state for further training under strict storage limits. The paper introduces Trajectory Regularized Merging to recast the merge step itself as an optimization inside an augmented trajectory subspace. Three joint objectives—task alignment, prediction consistency, and gradient responsiveness—aim to hold prior knowledge stable while restoring the ability to learn anew. Experiments across standard benchmarks indicate improved final performance compared with prior merging baselines.

Core claim

The paper claims that reformulating the merging phase as an optimization process within an augmented trajectory subspace, driven by the simultaneous pursuit of task alignment, prediction consistency, and gradient responsiveness, lets a merged model retain historical stability and re-activate optimization dynamics at the start of each new task without any storage of previous models or data.

What carries the argument

Trajectory Regularized Merging (TRM), an optimization procedure performed inside an augmented trajectory subspace that jointly enforces three objectives to balance stability and responsiveness.

If this is right

  • Merged models retain task-specific performance without progressive degradation across a long sequence of tasks.
  • Gradient signals remain usable at the beginning of each new task, allowing continued training to proceed efficiently.
  • The same merged model can serve as the starting point for arbitrary new tasks without replay buffers or saved checkpoints.
  • Multi-task unification becomes feasible in memory-constrained continual-learning settings that previously required separate model copies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subspace-augmentation idea could be grafted onto other merging algorithms to reduce their reliance on replay data.
  • If the three objectives prove separable, practitioners might drop one or two of them in low-resource deployments while keeping most of the benefit.
  • Similar trajectory-based regularization might help in federated or distributed settings where old model versions cannot be retained.

Load-bearing premise

The main obstacles are error accumulation from global alignment and vanishing gradients at task onset, and that these three objectives can be jointly optimized in the augmented subspace to correct them without storing any prior models or data.

What would settle it

On a standard continual-learning benchmark sequence, if the TRM-merged model still shows rapid error growth from early tasks or requires many more steps than storage-based methods to reach low loss on a new task, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.08311 by Cheng Deng, Xi Wang.

Figure 2
Figure 2. Figure 2: Output drift between ˜θ1 and θ ζ 0,1 . We visualize the differences between different layers of the model for the same input. 3.2.2. DISRUPTION OF STRUCTURAL SEMANTIC REPRESENTATION Beyond local optimality, we investigate the internal repre￾sentational discrepancy between different layers of ˜θ1 and θ ζ 0,1 . For any input x, the activation deviation at layer l is quantified as the expected L2 distance bet… view at source ↗
Figure 1
Figure 1. Figure 1: Loss landscape. We visualized the loss landscape along the trajectory from ˜θ0 to ˜θ1 and projected all merged points onto this surface based on their corresponding loss values. 3.2.1. SUBOPTIMAL LOCAL CONVERGENCE To visualize the degree of local optimality preservation, we finetune ˜θ0 on T1 to obtain ˜θ1, and get the merged model θ ζ 0,1 , where ζ ∈ {TIES, Model Stock, MagMax}. We project θ ζ 0,1 onto th… view at source ↗
Figure 3
Figure 3. Figure 3: Gradient variations. We visualize the gradient variations between the initial point and neighboring points along the training trajectory. where L indicates the loss function and δ is the perturbation on the training trajectory. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ratio of replaced parameters. The experiments were conducted on ImageNet-R with 10 tasks, and for each ratio, we conducted 10 independent training runs and recorded the maxi￾mum, minimum, and average performance. 10 tasks from the ImageNet-R dataset. In experiments, the position of the replaced parameters and the source of the replacement (from θt−1 or ˜θt) are completely random, only the overall replaceme… view at source ↗
Figure 4
Figure 4. Figure 4: Experiments on ImageNet1K are conducted by evenly dividing the 1000 classes into 100 distinct training tasks. 5.4. Further Analysis 5.4.1. LARGE SCALE DATASET. We conducted experiments on the large-scale ImageNet1K dataset by evenly dividing its 1000 classes into 100 non￾overlapping tasks. The experimental results are presented in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis. The horizontal axis represents different values of λ1 while the varying line styles indicate different values of λ2 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis of searching epoch. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper analyzes model merging methods under continual learning (CL) constraints, identifying that global alignment causes error accumulation and amplification in sequential data streams, while vanishing gradients at new task onsets lead to optimization stagnation and suboptimal merged models. It proposes Trajectory Regularized Merging (TRM), which reformulates merging as an optimization process inside an augmented trajectory subspace. TRM integrates three objectives—task alignment, prediction consistency, and gradient responsiveness—to preserve historical stability while reactivating dynamics, all without storing prior models or data. Experiments claim state-of-the-art results across multiple CL benchmarks.

Significance. If TRM's no-storage guarantee holds and the three objectives can be jointly optimized from current-task information alone, the work would meaningfully advance CL by enabling practical model merging without the memory overhead that currently limits its adoption. The focus on restoring early-task optimization dynamics addresses a practical bottleneck in sequential training pipelines.

major comments (3)
  1. [§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.
  2. [§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.
  3. [§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.
minor comments (3)
  1. [Abstract] Abstract: The high-level description of the three objectives would be clearer if it briefly indicated the information each objective uses (current-task only vs. historical).
  2. [§3] Notation: Ensure consistent use of symbols for the augmented subspace and the three loss terms across the method and experiment sections to avoid reader confusion.
  3. [§2] Related work: A short paragraph contrasting TRM with prior merging methods that also claim reduced storage (e.g., those using parameter-efficient adapters) would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and valuable comments that help clarify the storage-independence claims of TRM. We address each major point below and will incorporate the requested details and ablations in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.

    Authors: The augmented trajectory subspace is constructed exclusively from the current model's parameters and the current task's data via a low-rank augmentation of the parameter space. The prediction consistency objective is implemented as a self-consistency regularizer that penalizes divergence between predictions on the original current-task samples and their augmented versions generated on-the-fly; no prior logits or models are involved. The gradient responsiveness term is the expected norm of the gradient of the merging loss with respect to the subspace coordinates, again computed solely from current-task forward and backward passes. We will expand §3 with explicit pseudocode and equations to demonstrate this dependence. revision: yes

  2. Referee: [§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.

    Authors: We agree that the current presentation is insufficiently precise. In the revision we will insert the exact loss expressions: task alignment is the standard cross-entropy on current data; prediction consistency is the KL divergence between f_θ(x) and f_θ(x+δ) for current x and on-the-fly perturbations δ; gradient responsiveness is ||∇_v L(θ + v)|| where v lies in the current-task-derived subspace. A short paragraph will argue that each term requires only the current model and current mini-batches, with no external storage. revision: yes

  3. Referee: [§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.

    Authors: We will add a dedicated ablation subsection in §4 that reports performance when each of the three objectives is removed in turn, together with a memory-footprint table showing that peak memory during merging equals that of a single forward-backward pass on the current task. We will also state explicitly in the experimental protocol that no prior models, gradients, or replay buffers are retained or accessed at any stage. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; proposal is algorithmic design without deductive reduction

full rationale

The provided abstract and description contain no equations, derivations, or first-principles steps. The paper introduces TRM as a conceptual framework reformulating merging via an augmented subspace and three objectives, supported by experimental claims. No load-bearing mathematical argument reduces to its inputs by construction, self-definition, or fitted renaming. The no-storage premise is a design claim verified (or not) empirically rather than deduced circularly. This is the common case for method-proposal papers lacking formal proofs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of concrete free parameters, axioms, or invented entities. The proposal implicitly relies on the existence and utility of an 'augmented trajectory subspace' and the joint effectiveness of the three objectives, but no details are supplied.

pith-pipeline@v0.9.0 · 5485 in / 1246 out tokens · 72736 ms · 2026-05-12T01:17:56.268486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    2024 , organization=

    Magmax: Leveraging model merging for seamless continual learning , author=. 2024 , organization=

  10. [10]

    2017 , publisher=

    Learning without forgetting , author=. 2017 , publisher=

  11. [11]

    2024 , organization=

    Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion , author=. 2024 , organization=

  12. [12]

    2024 , publisher=

    Class-incremental learning: A survey , author=. 2024 , publisher=

  13. [13]

    Long-tail class incremental learning via independent sub-prototype construction , author=

  14. [14]

    Learning to prompt for continual learning , author=

  15. [15]

    2022 , organization=

    Dualprompt: Complementary prompting for rehearsal-free continual learning , author=. 2022 , organization=

  16. [16]

    Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning , author=

  17. [17]

    Editing models with task arithmetic , booktitle =

    Gabriel Ilharco and Marco T. Editing models with task arithmetic , booktitle =

  18. [18]

    2022 , organization=

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , organization=

  19. [19]

    Do better imagenet models transfer better? , author=

  20. [20]

    2024 , organization=

    Model stock: All we need is just a few fine-tuned models , author=. 2024 , organization=

  21. [21]

    Ties-merging: Resolving interference when merging models , author=

  22. [22]

    Sebastian Dziadzio and Vishaal Udandarao and Karsten Roth and Ameya Prabhu and Zeynep Akata and Samuel Albanie and Matthias Bethge , title =

  23. [23]

    2022 , organization=

    Improving task-free continual learning by distributionally robust memory evolution , author=. 2022 , organization=

  24. [24]

    2023 , organization=

    Does continual learning equally forget all parameters? , author=. 2023 , organization=

  25. [25]

    icarl: Incremental classifier and representation learning , author=

  26. [26]

    Introducing language guidance in prompt-based continual learning , author=

  27. [27]

    Language Guided Concept Bottleneck Models for Interpretable Continual Learning , author=

  28. [28]

    Training-free pretrained model merging , author=

  29. [29]

    Machine learning , volume=

    Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=

  30. [30]

    Averaging Weights Leads to Wider Optima and Better Generalization

    Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=

  31. [31]

    2025 , publisher=

    Class Incremental Learning via Contrastive Complementary Augmentation , author=. 2025 , publisher=

  32. [32]

    Prompt gradient projection for continual learning , author=

  33. [33]

    Class-incremental learning via dual augmentation , author=

  34. [34]

    Mixture of experts meets prompt-based continual learning , author=

  35. [35]

    2024 , organization=

    Beyond prompt learning: Continual adapter for efficient rehearsal-free continual learning , author=. 2024 , organization=

  36. [36]

    Semantically-shifted incremental adapter-tuning is a continual vitransformer , author=

  37. [37]

    Boosting continual learning of vision-language models via mixture-of-experts adapters , author=

  38. [38]

    arXiv preprint arXiv:2310.05905 , year=

    Tail: Task-specific adapters for imitation learning with large pretrained models , author=. arXiv preprint arXiv:2310.05905 , year=

  39. [39]

    Junhao Zheng and Xidi Cai and Shengjie Qiu and Qianli Ma , title =

  40. [40]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  41. [41]

    The many faces of robustness: A critical analysis of out-of-distribution generalization , author=

  42. [42]

    Proceedings of the IEEE international conference on computer vision workshops , pages=

    3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

  43. [43]

    Moment matching for multi-source domain adaptation , author=

  44. [44]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  45. [45]

    arXiv preprint arXiv:2505.22389 , year=

    Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning , author=. arXiv preprint arXiv:2505.22389 , year=

  46. [46]

    2004 , publisher=

    Convex Optimization , author=. 2004 , publisher=

  47. [47]

    2006 , publisher=

    Numerical Optimization , author=. 2006 , publisher=

  48. [48]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  49. [49]

    BECAME: BayEsian Continual Learning with Adaptive Model MErging , author =

  50. [50]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

  51. [51]

    ICML , year=

    Continual learning through synaptic intelligence , author=. ICML , year=

  52. [52]

    ECCV , year=

    Memory aware synapses: Learning what (not) to forget , author=. ECCV , year=

  53. [53]

    CVPR , year=

    iCaRL: Incremental classifier and representation learning , author=. CVPR , year=

  54. [54]

    NeurIPS , year=

    Gradient episodic memory for continual learning , author=. NeurIPS , year=

  55. [55]

    NeurIPS , year=

    Dark experience for backward compatibility: Re-addressing teacher-student learning in continual learning , author=. NeurIPS , year=

  56. [56]

    Progressive Neural Networks

    Progressive neural networks , author=. arXiv preprint arXiv:1606.04671 , year=

  57. [57]

    CVPR , year=

    DER: Dynamically expandable representation for class-incremental learning , author=. CVPR , year=

  58. [58]

    ICML , year=

    Understanding the Role of Optimization in Catastrophic Forgetting , author=. ICML , year=

  59. [59]

    ICML , year=

    Optimal continual learning has symmetric forgetting , author=. ICML , year=

  60. [60]

    ICLR , year=

    Modeling the optimization landscape of continual learning , author=. ICLR , year=