arxiv: 2605.08311 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning

Xi Wang , Cheng Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords continual learningmodel mergingtrajectory regularizationstorage efficiencygradient dynamicstask alignmentmulti-task models

0 comments

The pith

Model merging in continual learning can proceed without storing prior models or data by optimizing in an augmented trajectory subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing merging approaches in continual learning prioritize global alignment, which accumulates task-specific errors over sequential data and produces vanishing gradients that stall optimization when a new task begins. This leaves the combined model in a poor starting state for further training under strict storage limits. The paper introduces Trajectory Regularized Merging to recast the merge step itself as an optimization inside an augmented trajectory subspace. Three joint objectives—task alignment, prediction consistency, and gradient responsiveness—aim to hold prior knowledge stable while restoring the ability to learn anew. Experiments across standard benchmarks indicate improved final performance compared with prior merging baselines.

Core claim

The paper claims that reformulating the merging phase as an optimization process within an augmented trajectory subspace, driven by the simultaneous pursuit of task alignment, prediction consistency, and gradient responsiveness, lets a merged model retain historical stability and re-activate optimization dynamics at the start of each new task without any storage of previous models or data.

What carries the argument

Trajectory Regularized Merging (TRM), an optimization procedure performed inside an augmented trajectory subspace that jointly enforces three objectives to balance stability and responsiveness.

If this is right

Merged models retain task-specific performance without progressive degradation across a long sequence of tasks.
Gradient signals remain usable at the beginning of each new task, allowing continued training to proceed efficiently.
The same merged model can serve as the starting point for arbitrary new tasks without replay buffers or saved checkpoints.
Multi-task unification becomes feasible in memory-constrained continual-learning settings that previously required separate model copies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace-augmentation idea could be grafted onto other merging algorithms to reduce their reliance on replay data.
If the three objectives prove separable, practitioners might drop one or two of them in low-resource deployments while keeping most of the benefit.
Similar trajectory-based regularization might help in federated or distributed settings where old model versions cannot be retained.

Load-bearing premise

The main obstacles are error accumulation from global alignment and vanishing gradients at task onset, and that these three objectives can be jointly optimized in the augmented subspace to correct them without storing any prior models or data.

What would settle it

On a standard continual-learning benchmark sequence, if the TRM-merged model still shows rapid error growth from early tasks or requires many more steps than storage-based methods to reach low loss on a new task, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.08311 by Cheng Deng, Xi Wang.

**Figure 2.** Figure 2: Output drift between ˜θ1 and θ ζ 0,1 . We visualize the differences between different layers of the model for the same input. 3.2.2. DISRUPTION OF STRUCTURAL SEMANTIC REPRESENTATION Beyond local optimality, we investigate the internal representational discrepancy between different layers of ˜θ1 and θ ζ 0,1 . For any input x, the activation deviation at layer l is quantified as the expected L2 distance bet… view at source ↗

**Figure 1.** Figure 1: Loss landscape. We visualized the loss landscape along the trajectory from ˜θ0 to ˜θ1 and projected all merged points onto this surface based on their corresponding loss values. 3.2.1. SUBOPTIMAL LOCAL CONVERGENCE To visualize the degree of local optimality preservation, we finetune ˜θ0 on T1 to obtain ˜θ1, and get the merged model θ ζ 0,1 , where ζ ∈ {TIES, Model Stock, MagMax}. We project θ ζ 0,1 onto th… view at source ↗

**Figure 3.** Figure 3: Gradient variations. We visualize the gradient variations between the initial point and neighboring points along the training trajectory. where L indicates the loss function and δ is the perturbation on the training trajectory. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Ratio of replaced parameters. The experiments were conducted on ImageNet-R with 10 tasks, and for each ratio, we conducted 10 independent training runs and recorded the maximum, minimum, and average performance. 10 tasks from the ImageNet-R dataset. In experiments, the position of the replaced parameters and the source of the replacement (from θt−1 or ˜θt) are completely random, only the overall replaceme… view at source ↗

**Figure 4.** Figure 4: Experiments on ImageNet1K are conducted by evenly dividing the 1000 classes into 100 distinct training tasks. 5.4. Further Analysis 5.4.1. LARGE SCALE DATASET. We conducted experiments on the large-scale ImageNet1K dataset by evenly dividing its 1000 classes into 100 nonoverlapping tasks. The experimental results are presented in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis. The horizontal axis represents different values of λ1 while the varying line styles indicate different values of λ2 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of searching epoch. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRM tries to merge models in continual learning without storing old ones by optimizing in an augmented trajectory subspace, but whether the three objectives really run on current-task data alone is the part that needs verification.

read the letter

The main claim here is that you can do model merging for continual learning without any storage of prior models or data. TRM recasts the merge as an optimization inside an augmented trajectory subspace and combines task alignment, prediction consistency, and gradient responsiveness to keep stability while restarting dynamics. That would matter for memory-tight settings if the independence holds.

Referee Report

3 major / 3 minor

Summary. The paper analyzes model merging methods under continual learning (CL) constraints, identifying that global alignment causes error accumulation and amplification in sequential data streams, while vanishing gradients at new task onsets lead to optimization stagnation and suboptimal merged models. It proposes Trajectory Regularized Merging (TRM), which reformulates merging as an optimization process inside an augmented trajectory subspace. TRM integrates three objectives—task alignment, prediction consistency, and gradient responsiveness—to preserve historical stability while reactivating dynamics, all without storing prior models or data. Experiments claim state-of-the-art results across multiple CL benchmarks.

Significance. If TRM's no-storage guarantee holds and the three objectives can be jointly optimized from current-task information alone, the work would meaningfully advance CL by enabling practical model merging without the memory overhead that currently limits its adoption. The focus on restoring early-task optimization dynamics addresses a practical bottleneck in sequential training pipelines.

major comments (3)

[§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.
[§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.
[§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.

minor comments (3)

[Abstract] Abstract: The high-level description of the three objectives would be clearer if it briefly indicated the information each objective uses (current-task only vs. historical).
[§3] Notation: Ensure consistent use of symbols for the augmented subspace and the three loss terms across the method and experiment sections to avoid reader confusion.
[§2] Related work: A short paragraph contrasting TRM with prior merging methods that also claim reduced storage (e.g., those using parameter-efficient adapters) would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and valuable comments that help clarify the storage-independence claims of TRM. We address each major point below and will incorporate the requested details and ablations in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.

Authors: The augmented trajectory subspace is constructed exclusively from the current model's parameters and the current task's data via a low-rank augmentation of the parameter space. The prediction consistency objective is implemented as a self-consistency regularizer that penalizes divergence between predictions on the original current-task samples and their augmented versions generated on-the-fly; no prior logits or models are involved. The gradient responsiveness term is the expected norm of the gradient of the merging loss with respect to the subspace coordinates, again computed solely from current-task forward and backward passes. We will expand §3 with explicit pseudocode and equations to demonstrate this dependence. revision: yes
Referee: [§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.

Authors: We agree that the current presentation is insufficiently precise. In the revision we will insert the exact loss expressions: task alignment is the standard cross-entropy on current data; prediction consistency is the KL divergence between f_θ(x) and f_θ(x+δ) for current x and on-the-fly perturbations δ; gradient responsiveness is ||∇_v L(θ + v)|| where v lies in the current-task-derived subspace. A short paragraph will argue that each term requires only the current model and current mini-batches, with no external storage. revision: yes
Referee: [§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.

Authors: We will add a dedicated ablation subsection in §4 that reports performance when each of the three objectives is removed in turn, together with a memory-footprint table showing that peak memory during merging equals that of a single forward-backward pass on the current task. We will also state explicitly in the experimental protocol that no prior models, gradients, or replay buffers are retained or accessed at any stage. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; proposal is algorithmic design without deductive reduction

full rationale

The provided abstract and description contain no equations, derivations, or first-principles steps. The paper introduces TRM as a conceptual framework reformulating merging via an augmented subspace and three objectives, supported by experimental claims. No load-bearing mathematical argument reduces to its inputs by construction, self-definition, or fitted renaming. The no-storage premise is a design claim verified (or not) empirically rather than deduced circularly. This is the common case for method-proposal papers lacking formal proofs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of concrete free parameters, axioms, or invented entities. The proposal implicitly relies on the existence and utility of an 'augmented trajectory subspace' and the joint effectiveness of the three objectives, but no details are supplied.

pith-pipeline@v0.9.0 · 5485 in / 1246 out tokens · 72736 ms · 2026-05-12T01:17:56.268486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

2024 , organization=

Magmax: Leveraging model merging for seamless continual learning , author=. 2024 , organization=

work page 2024
[10]

2017 , publisher=

Learning without forgetting , author=. 2017 , publisher=

work page 2017
[11]

2024 , organization=

Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion , author=. 2024 , organization=

work page 2024
[12]

2024 , publisher=

Class-incremental learning: A survey , author=. 2024 , publisher=

work page 2024
[13]

Long-tail class incremental learning via independent sub-prototype construction , author=

work page
[14]

Learning to prompt for continual learning , author=

work page
[15]

2022 , organization=

Dualprompt: Complementary prompting for rehearsal-free continual learning , author=. 2022 , organization=

work page 2022
[16]

Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning , author=

work page
[17]

Editing models with task arithmetic , booktitle =

Gabriel Ilharco and Marco T. Editing models with task arithmetic , booktitle =

work page
[18]

2022 , organization=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , organization=

work page 2022
[19]

Do better imagenet models transfer better? , author=

work page
[20]

2024 , organization=

Model stock: All we need is just a few fine-tuned models , author=. 2024 , organization=

work page 2024
[21]

Ties-merging: Resolving interference when merging models , author=

work page
[22]

Sebastian Dziadzio and Vishaal Udandarao and Karsten Roth and Ameya Prabhu and Zeynep Akata and Samuel Albanie and Matthias Bethge , title =

work page
[23]

2022 , organization=

Improving task-free continual learning by distributionally robust memory evolution , author=. 2022 , organization=

work page 2022
[24]

2023 , organization=

Does continual learning equally forget all parameters? , author=. 2023 , organization=

work page 2023
[25]

icarl: Incremental classifier and representation learning , author=

work page
[26]

Introducing language guidance in prompt-based continual learning , author=

work page
[27]

Language Guided Concept Bottleneck Models for Interpretable Continual Learning , author=

work page
[28]

Training-free pretrained model merging , author=

work page
[29]

Machine learning , volume=

Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=

work page 1996
[30]

Averaging Weights Leads to Wider Optima and Better Generalization

Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=

work page Pith review arXiv
[31]

2025 , publisher=

Class Incremental Learning via Contrastive Complementary Augmentation , author=. 2025 , publisher=

work page 2025
[32]

Prompt gradient projection for continual learning , author=

work page
[33]

Class-incremental learning via dual augmentation , author=

work page
[34]

Mixture of experts meets prompt-based continual learning , author=

work page
[35]

2024 , organization=

Beyond prompt learning: Continual adapter for efficient rehearsal-free continual learning , author=. 2024 , organization=

work page 2024
[36]

Semantically-shifted incremental adapter-tuning is a continual vitransformer , author=

work page
[37]

Boosting continual learning of vision-language models via mixture-of-experts adapters , author=

work page
[38]

arXiv preprint arXiv:2310.05905 , year=

Tail: Task-specific adapters for imitation learning with large pretrained models , author=. arXiv preprint arXiv:2310.05905 , year=

work page arXiv
[39]

Junhao Zheng and Xidi Cai and Shengjie Qiu and Qianli Ma , title =

work page
[40]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[41]

The many faces of robustness: A critical analysis of out-of-distribution generalization , author=

work page
[42]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

work page
[43]

Moment matching for multi-source domain adaptation , author=

work page
[44]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

work page 2017
[45]

arXiv preprint arXiv:2505.22389 , year=

Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning , author=. arXiv preprint arXiv:2505.22389 , year=

work page arXiv
[46]

2004 , publisher=

Convex Optimization , author=. 2004 , publisher=

work page 2004
[47]

2006 , publisher=

Numerical Optimization , author=. 2006 , publisher=

work page 2006
[48]

2016 , publisher=

Deep Learning , author=. 2016 , publisher=

work page 2016
[49]

BECAME: BayEsian Continual Learning with Adaptive Model MErging , author =

work page
[50]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

work page internal anchor Pith review arXiv
[51]

ICML , year=

Continual learning through synaptic intelligence , author=. ICML , year=

work page
[52]

ECCV , year=

Memory aware synapses: Learning what (not) to forget , author=. ECCV , year=

work page
[53]

CVPR , year=

iCaRL: Incremental classifier and representation learning , author=. CVPR , year=

work page
[54]

NeurIPS , year=

Gradient episodic memory for continual learning , author=. NeurIPS , year=

work page
[55]

NeurIPS , year=

Dark experience for backward compatibility: Re-addressing teacher-student learning in continual learning , author=. NeurIPS , year=

work page
[56]

Progressive Neural Networks

Progressive neural networks , author=. arXiv preprint arXiv:1606.04671 , year=

work page internal anchor Pith review arXiv
[57]

CVPR , year=

DER: Dynamically expandable representation for class-incremental learning , author=. CVPR , year=

work page
[58]

ICML , year=

Understanding the Role of Optimization in Catastrophic Forgetting , author=. ICML , year=

work page
[59]

ICML , year=

Optimal continual learning has symmetric forgetting , author=. ICML , year=

work page
[60]

ICLR , year=

Modeling the optimization landscape of continual learning , author=. ICLR , year=

work page