arxiv: 2507.20997 · v4 · submitted 2025-07-28 · 💻 cs.LG · cs.AI

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Haris Khan , Sadia Asif , Shumaila Asif , Muhammad Zeeshan Karamat , Rajesh Upadhayaya This is my paper

Pith reviewed 2026-05-19 02:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model mergingcontinual learningorthogonal constraintsdelta mergingreversible compositiontask interferencefine-tuned modelsmodel composition

0 comments p. Extension

The pith

Encoding each fine-tuned model as an orthogonal delta from a shared base allows merging them without interference while supporting continual addition and reversible removal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to combine multiple task-specific models into one unified model that keeps performance on every original task. It represents each model's updates as a delta from a common starting point, then projects those deltas into non-overlapping directions so they cannot cancel one another out. The projected deltas are combined through gradient-based optimization, and the same structure lets new models be added later or specific ones be removed on demand for privacy rules. A reader would care because real deployments need models that can grow, shrink, and stay accurate without retraining everything from scratch or suffering forgetting.

Core claim

MDM-OC encodes each task-specific model as a delta from a shared base model, projects each delta into an orthogonal subspace to remove conflicts, and merges the projected deltas through gradient-based optimization to produce a single model that retains accuracy on all tasks while supporting continual addition of new models and structured unmerging.

What carries the argument

Orthogonal projection of task deltas into separate subspaces, which ensures no parameter-space overlap between tasks during merging.

If this is right

New models can be added continually without retraining or forgetting prior tasks.
Specific models can be unmerged on demand to satisfy compliance requirements such as GDPR.
The merged model maintains higher accuracy and backward transfer than prior merging baselines on vision and language benchmarks.
Merging stays memory-efficient and computationally practical at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orthogonal-delta approach could extend to combining models trained on entirely different data modalities.
Dynamic model libraries become feasible in which users add and drop capabilities without storing full copies of every variant.
Integration with privacy techniques might allow selective removal of data influences without full retraining.

Load-bearing premise

Forcing task deltas into orthogonal subspaces removes interference between tasks without causing large losses in each task's specific performance.

What would settle it

A merged model that shows clear drops in accuracy on any individual task compared with the separate fine-tuned version, or an unmerge step that fails to restore the original task-specific performance.

Figures

Figures reproduced from arXiv: 2507.20997 by Haris Khan, Muhammad Zeeshan Karamat, Rajesh Upadhayaya, Sadia Asif, Shumaila Asif.

**Figure 2.** Figure 2: PCA-projected view of task delta vectors be [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the unmerging process. A specific [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Unmerging fidelity comparison of different [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Comparison of continual learning performance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a reversible model merging method by projecting task deltas into orthogonal subspaces before gradient optimization, but the abstract supplies no numbers or ablations to check whether the projection preserves performance.

read the letter

Colleague, the core idea here is encoding each fine-tuned model as a delta from a shared base, projecting those deltas to be mutually orthogonal, and then merging them through gradient-based optimization while adding elastic weight consolidation and synthetic replay for stability. The claim is that this produces a unified model that keeps accuracy across tasks and can be selectively unmerged later for compliance reasons. That combination is what they present as new, building on existing delta merging and continual learning lines without a first-principles rewrite of the field. The description is straightforward on the high-level workflow and correctly flags practical problems like task interference and the need for reversibility in deployed systems. The focus on memory efficiency and continual addition of models is also a reasonable practical angle. The main weakness is the missing evidence. The abstract asserts outperformance on accuracy, backward transfer, and unmerge fidelity on vision and NLP benchmarks, yet gives no quantitative results, error bars, or ablation isolating the orthogonal projection step. Without those details or any bound on how much of each delta's norm survives the projection, it is difficult to assess whether the method actually avoids the signal loss that occurs when task vectors share directions. The stress-test concern about discarded task-specific information therefore stands until the full experiments are examined. This is the sort of paper that would interest engineers and researchers working on modular lifelong learning systems who need reversible updates for regulatory or operational reasons. A reader already following model merging work could pick up the framework description and see whether the orthogonal-plus-gradient approach fits their setting. It has enough structure and relevance to go to a serious referee who can review the actual results and any analysis of the projection operator. I would recommend sending it for peer review, provided the authors supply the quantitative evaluations and ablations upfront.

Referee Report

2 major / 1 minor

Summary. The paper proposes Modular Delta Merging with Orthogonal Constraints (MDM-OC), a framework for scalable, interference-free, and reversible composition of fine-tuned models. Each task model is encoded as a delta from a shared base, projected into mutually orthogonal subspaces to eliminate conflicts, and merged via gradient-based optimization. The approach incorporates elastic weight consolidation and synthetic replay for stability, supports continual addition of new models and structured unmerging for compliance (e.g., GDPR), and claims to outperform baselines on vision and NLP benchmarks in accuracy, backward transfer, and unmerge fidelity while remaining memory-efficient.

Significance. If the central claims hold, this would represent a meaningful contribution to continual learning and model merging by addressing task interference and enabling reversibility without requiring task data at merge time. The combination of orthogonal constraints with gradient merging and replay mechanisms could support practical modular AI systems where models must be composed, updated, and selectively undone.

major comments (2)

The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.
The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.

minor comments (1)

Clarify the notation for the projection operator and delta encoding early in the method section to improve readability for readers unfamiliar with the construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.

Authors: We appreciate this insightful observation. To address the lack of explicit bounds and analysis, we have added a new subsection in the Methods (Section 3.2) providing a bound on the retained norm: the projected delta retains at least (1 - max cosine similarity) of the original norm in expectation, with a full derivation of the projection error. We also include an ablation study in the Experiments section (Table 4) isolating the orthogonal projection step, showing significant improvement in interference reduction when enabled. Regarding the potential requirement for task data: the gradient-based merging optimizes the merged model using only the orthogonalized deltas and the base model parameters, without access to task-specific data or surrogates. The orthogonality ensures minimal overlap, and any residual interference is mitigated by the EWC and replay mechanisms which are applied during initial fine-tuning, not at merge time. We have clarified this distinction in the revised text to avoid any misinterpretation. revision: yes
Referee: The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.

Authors: We thank the referee for pointing this out. While the abstract is intended as a concise summary, the full quantitative results are detailed in the Experiments section. Specifically, Table 1 reports accuracy and backward transfer on vision benchmarks (CIFAR-100, ImageNet subsets) with error bars from 5 random seeds. Table 2 covers NLP tasks (GLUE subsets). Figure 3 shows unmerge fidelity metrics. Ablation studies are in Table 5, including variants without orthogonal constraints. All comparisons to baselines (Task Arithmetic, TIES-Merging, etc.) are provided with statistical significance tests. We believe these support the claims, but if the editor deems it necessary, we can add a summary table in the main text or highlight key numbers more prominently. revision: partial

Circularity Check

0 steps flagged

No circularity: framework introduced as novel construction

full rationale

The paper proposes MDM-OC as a new method that encodes fine-tuned models as deltas from a base, projects them into orthogonal subspaces, and merges via gradient optimization, with added stability via EWC and synthetic replay. No equations, parameters, or claims are shown to reduce by construction to fitted inputs, self-citations, or prior results from the same authors. The derivation is presented as an original framework rather than a statistical prediction or renamed known result, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the domain assumption that orthogonal projection removes task conflicts and on the existence of an optimization procedure that can recover a high-performing merged model; no explicit free parameters or new physical entities are named in the abstract.

axioms (1)

domain assumption Orthogonal projection of task deltas eliminates interference while preserving individual task performance
Invoked as the core mechanism for conflict elimination in the abstract description of the projection step.

invented entities (1)

MDM-OC framework no independent evidence
purpose: Scalable, interference-free, reversible model composition
Newly proposed named method whose components are defined within the paper.

pith-pipeline@v0.9.0 · 5736 in / 1365 out tokens · 82865 ms · 2026-05-19T02:11:10.121758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sequential Gram-Schmidt orthogonalization projects each delta onto the null space of previously processed deltas

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neu- ral Information Processing Systems, pp. 5998–6008, 2017

work page 2017
[2]

An im- age is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An im- age is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

work page 2021
[3]

Communication-efficient learn- ing of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learn- ing of deep networks from decentralized data,” in Artificial Intelligence and Statistics, pp. 1273–1282, 2017

work page 2017
[4]

Multiagent systems: A survey from a machine learning perspective,

P. Stone and M. Veloso, “Multiagent systems: A survey from a machine learning perspective,” Au- tonomous Robots, vol. 8, no. 3, pp. 345–383, 2000

work page 2000
[5]

Hidden technical debt in machine learning sys- tems,

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V . Chaudhary, and M. Young, “Hidden technical debt in machine learning sys- tems,” in Advances in Neural Information Process- ing Systems, pp. 2503–2511, 2015

work page 2015
[6]

V oigt and A

P. V oigt and A. V on dem Bussche, The EU gen- eral data protection regulation (GDPR): A practical guide. Springer, 2017

work page 2017
[7]

Catastrophic in- terference in connectionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic in- terference in connectionist networks: The sequential learning problem,” in Psychology of Learning and Motivation, vol. 24, pp. 109–165, 1989

work page 1989
[8]

Catastrophic forgetting in connec- tionist networks,

R. M. French, “Catastrophic forgetting in connec- tionist networks,” Trends in Cognitive Sciences , vol. 3, no. 4, pp. 128–135, 1999

work page 1999
[9]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “Lora: Low-rank adap- tation of large language models,” arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Ties-merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” in Advances in Neural In- formation Processing Systems, 2023

work page 2023
[12]

Adapterfusion: Non-destructive task composition for transfer learning,

J. Pfeiffer, A. Kamath, A. R ¨uckl´e, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” in Conference of the European Chapter of the Association for Com- putational Linguistics, pp. 487–503, 2021

work page 2021
[13]

Over- coming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Ve- ness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Over- coming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 13, pp. 3521–3526, 2017. 10

work page 2017
[14]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in Interna- tional Conference on Machine Learning, pp. 3987– 3995, 2017

work page 2017
[15]

Learning without forgetting,

Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2935–2947, 2017

work page 2017
[16]

Or- thogonal gradient descent for continual learning,

M. Farajtabar, N. Azizan, A. Mott, and A. Li, “Or- thogonal gradient descent for continual learning,” in International Conference on Artificial Intelligence and Statistics, pp. 3762–3773, 2020

work page 2020
[17]

Gradient episodic memory for continual learning,

D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” inAdvances in Neu- ral Information Processing Systems, pp. 6467–6476, 2017

work page 2017
[18]

Efficient Lifelong Learning with A-GEM

A. Chaudhry, M. Ranzato, M. Rohrbach, and M. El- hoseiny, “Efficient lifelong learning with a-gem,” arXiv preprint arXiv:1812.00420, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Gradient projection memory for continual learning,

G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual learning,” in International Conference on Learning Representations, 2021

work page 2021
[20]

Model soups: av- eraging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,

M. Wortsman, G. Ilharco, M. T. Ribeiro, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Model soups: av- eraging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” in International Conference on Machine Learning , pp. 23965–23998, 2022

work page 2022
[21]

Robust fine-tuning of zero-shot mod- els,

M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Ko- rnblith, R. Roelofs, A. Farhadi, H. Hajishirzi, and L. Schmidt, “Robust fine-tuning of zero-shot mod- els,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022

work page 2022
[22]

Diverse weight averaging for out-of-distribution generalization,

A. Rame, M. Kirchmeyer, T. Rahier, A. Rakotoma- monjy, P. Gallinari, and M. Cord, “Diverse weight averaging for out-of-distribution generalization,” in Advances in Neural Information Processing Sys- tems, pp. 9821–9834, 2022

work page 2022
[23]

Dataless knowl- edge fusion by merging weights of language mod- els,

X. Jin, P. Ren, R. Liao, and R. Lyu, “Dataless knowl- edge fusion by merging weights of language mod- els,” in International Conference on Learning Rep- resentations, 2022

work page 2022
[24]

Locating and editing factual associations in gpt,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,” in Advances in Neural Information Processing Sys- tems, 2022

work page 2022
[25]

Continual learning in vision-language mod- els via aligned model merging,

G. Sokar, D. C. Mocanu, and M. Pechenizkiy, “Continual learning in vision-language mod- els via aligned model merging,” arXiv preprint arXiv:2401.10421, 2024

work page arXiv 2024
[26]

Continual learning through orthogonal weight modification,

G. Zeng, Y . Chen, B. Cui, and S. Yu, “Continual learning through orthogonal weight modification,” in International Joint Conference on Artificial Intel- ligence, pp. 2921–2927, 2018

work page 2018
[27]

Mass editing memory in a trans- former,

K. Meng, A. S. Sharma, A. Andonian, Y . Belinkov, and D. Bau, “Mass editing memory in a trans- former,” in International Conference on Learning Representations, 2023

work page 2023
[28]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in Ad- vances in Neural Information Processing Systems , pp. 8024–8035, 2019

work page 2019
[30]

L. N. Trefethen and D. Bau III, Numerical linear al- gebra, vol. 50. SIAM, 1997

work page 1997
[31]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” technical re- port, University of Toronto, 2009

work page 2009
[32]

Character-level convolutional networks for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Sys- tems, pp. 649–657, 2015. 11

work page 2015