pith. the verified trust layer for science. sign in

arxiv: 2507.20997 · v4 · submitted 2025-07-28 · 💻 cs.LG · cs.AI

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Pith reviewed 2026-05-19 02:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords model mergingcontinual learningorthogonal constraintsdelta mergingreversible compositiontask interferencefine-tuned modelsmodel composition
0
0 comments X p. Extension

The pith

Encoding each fine-tuned model as an orthogonal delta from a shared base allows merging them without interference while supporting continual addition and reversible removal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to combine multiple task-specific models into one unified model that keeps performance on every original task. It represents each model's updates as a delta from a common starting point, then projects those deltas into non-overlapping directions so they cannot cancel one another out. The projected deltas are combined through gradient-based optimization, and the same structure lets new models be added later or specific ones be removed on demand for privacy rules. A reader would care because real deployments need models that can grow, shrink, and stay accurate without retraining everything from scratch or suffering forgetting.

Core claim

MDM-OC encodes each task-specific model as a delta from a shared base model, projects each delta into an orthogonal subspace to remove conflicts, and merges the projected deltas through gradient-based optimization to produce a single model that retains accuracy on all tasks while supporting continual addition of new models and structured unmerging.

What carries the argument

Orthogonal projection of task deltas into separate subspaces, which ensures no parameter-space overlap between tasks during merging.

If this is right

  • New models can be added continually without retraining or forgetting prior tasks.
  • Specific models can be unmerged on demand to satisfy compliance requirements such as GDPR.
  • The merged model maintains higher accuracy and backward transfer than prior merging baselines on vision and language benchmarks.
  • Merging stays memory-efficient and computationally practical at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same orthogonal-delta approach could extend to combining models trained on entirely different data modalities.
  • Dynamic model libraries become feasible in which users add and drop capabilities without storing full copies of every variant.
  • Integration with privacy techniques might allow selective removal of data influences without full retraining.

Load-bearing premise

Forcing task deltas into orthogonal subspaces removes interference between tasks without causing large losses in each task's specific performance.

What would settle it

A merged model that shows clear drops in accuracy on any individual task compared with the separate fine-tuned version, or an unmerge step that fails to restore the original task-specific performance.

Figures

Figures reproduced from arXiv: 2507.20997 by Haris Khan, Muhammad Zeeshan Karamat, Rajesh Upadhayaya, Sadia Asif, Shumaila Asif.

Figure 1
Figure 1. Figure 1: Overview of the Modular Delta Merging with [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA-projected view of task delta vectors be [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the unmerging process. A specific [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Unmerging fidelity comparison of different [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of continual learning performance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Modular Delta Merging with Orthogonal Constraints (MDM-OC), a framework for scalable, interference-free, and reversible composition of fine-tuned models. Each task model is encoded as a delta from a shared base, projected into mutually orthogonal subspaces to eliminate conflicts, and merged via gradient-based optimization. The approach incorporates elastic weight consolidation and synthetic replay for stability, supports continual addition of new models and structured unmerging for compliance (e.g., GDPR), and claims to outperform baselines on vision and NLP benchmarks in accuracy, backward transfer, and unmerge fidelity while remaining memory-efficient.

Significance. If the central claims hold, this would represent a meaningful contribution to continual learning and model merging by addressing task interference and enabling reversibility without requiring task data at merge time. The combination of orthogonal constraints with gradient merging and replay mechanisms could support practical modular AI systems where models must be composed, updated, and selectively undone.

major comments (2)
  1. The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.
  2. The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.
minor comments (1)
  1. Clarify the notation for the projection operator and delta encoding early in the method section to improve readability for readers unfamiliar with the construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.

    Authors: We appreciate this insightful observation. To address the lack of explicit bounds and analysis, we have added a new subsection in the Methods (Section 3.2) providing a bound on the retained norm: the projected delta retains at least (1 - max cosine similarity) of the original norm in expectation, with a full derivation of the projection error. We also include an ablation study in the Experiments section (Table 4) isolating the orthogonal projection step, showing significant improvement in interference reduction when enabled. Regarding the potential requirement for task data: the gradient-based merging optimizes the merged model using only the orthogonalized deltas and the base model parameters, without access to task-specific data or surrogates. The orthogonality ensures minimal overlap, and any residual interference is mitigated by the EWC and replay mechanisms which are applied during initial fine-tuning, not at merge time. We have clarified this distinction in the revised text to avoid any misinterpretation. revision: yes

  2. Referee: The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.

    Authors: We thank the referee for pointing this out. While the abstract is intended as a concise summary, the full quantitative results are detailed in the Experiments section. Specifically, Table 1 reports accuracy and backward transfer on vision benchmarks (CIFAR-100, ImageNet subsets) with error bars from 5 random seeds. Table 2 covers NLP tasks (GLUE subsets). Figure 3 shows unmerge fidelity metrics. Ablation studies are in Table 5, including variants without orthogonal constraints. All comparisons to baselines (Task Arithmetic, TIES-Merging, etc.) are provided with statistical significance tests. We believe these support the claims, but if the editor deems it necessary, we can add a summary table in the main text or highlight key numbers more prominently. revision: partial

Circularity Check

0 steps flagged

No circularity: framework introduced as novel construction

full rationale

The paper proposes MDM-OC as a new method that encodes fine-tuned models as deltas from a base, projects them into orthogonal subspaces, and merges via gradient optimization, with added stability via EWC and synthetic replay. No equations, parameters, or claims are shown to reduce by construction to fitted inputs, self-citations, or prior results from the same authors. The derivation is presented as an original framework rather than a statistical prediction or renamed known result, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the domain assumption that orthogonal projection removes task conflicts and on the existence of an optimization procedure that can recover a high-performing merged model; no explicit free parameters or new physical entities are named in the abstract.

axioms (1)
  • domain assumption Orthogonal projection of task deltas eliminates interference while preserving individual task performance
    Invoked as the core mechanism for conflict elimination in the abstract description of the projection step.
invented entities (1)
  • MDM-OC framework no independent evidence
    purpose: Scalable, interference-free, reversible model composition
    Newly proposed named method whose components are defined within the paper.

pith-pipeline@v0.9.0 · 5736 in / 1365 out tokens · 82865 ms · 2026-05-19T02:11:10.121758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neu- ral Information Processing Systems, pp. 5998–6008, 2017

  2. [2]

    An im- age is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An im- age is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

  3. [3]

    Communication-efficient learn- ing of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learn- ing of deep networks from decentralized data,” in Artificial Intelligence and Statistics, pp. 1273–1282, 2017

  4. [4]

    Multiagent systems: A survey from a machine learning perspective,

    P. Stone and M. Veloso, “Multiagent systems: A survey from a machine learning perspective,” Au- tonomous Robots, vol. 8, no. 3, pp. 345–383, 2000

  5. [5]

    Hidden technical debt in machine learning sys- tems,

    D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V . Chaudhary, and M. Young, “Hidden technical debt in machine learning sys- tems,” in Advances in Neural Information Process- ing Systems, pp. 2503–2511, 2015

  6. [6]

    V oigt and A

    P. V oigt and A. V on dem Bussche, The EU gen- eral data protection regulation (GDPR): A practical guide. Springer, 2017

  7. [7]

    Catastrophic in- terference in connectionist networks: The sequential learning problem,

    M. McCloskey and N. J. Cohen, “Catastrophic in- terference in connectionist networks: The sequential learning problem,” in Psychology of Learning and Motivation, vol. 24, pp. 109–165, 1989

  8. [8]

    Catastrophic forgetting in connec- tionist networks,

    R. M. French, “Catastrophic forgetting in connec- tionist networks,” Trends in Cognitive Sciences , vol. 3, no. 4, pp. 128–135, 1999

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “Lora: Low-rank adap- tation of large language models,” arXiv preprint arXiv:2106.09685, 2021

  10. [10]

    Editing Models with Task Arithmetic

    G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089 , 2022

  11. [11]

    Ties-merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” in Advances in Neural In- formation Processing Systems, 2023

  12. [12]

    Adapterfusion: Non-destructive task composition for transfer learning,

    J. Pfeiffer, A. Kamath, A. R ¨uckl´e, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” in Conference of the European Chapter of the Association for Com- putational Linguistics, pp. 487–503, 2021

  13. [13]

    Over- coming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Ve- ness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Over- coming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 13, pp. 3521–3526, 2017. 10

  14. [14]

    Continual learning through synaptic intelligence,

    F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in Interna- tional Conference on Machine Learning, pp. 3987– 3995, 2017

  15. [15]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2935–2947, 2017

  16. [16]

    Or- thogonal gradient descent for continual learning,

    M. Farajtabar, N. Azizan, A. Mott, and A. Li, “Or- thogonal gradient descent for continual learning,” in International Conference on Artificial Intelligence and Statistics, pp. 3762–3773, 2020

  17. [17]

    Gradient episodic memory for continual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” inAdvances in Neu- ral Information Processing Systems, pp. 6467–6476, 2017

  18. [18]

    Efficient Lifelong Learning with A-GEM

    A. Chaudhry, M. Ranzato, M. Rohrbach, and M. El- hoseiny, “Efficient lifelong learning with a-gem,” arXiv preprint arXiv:1812.00420, 2018

  19. [19]

    Gradient projection memory for continual learning,

    G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual learning,” in International Conference on Learning Representations, 2021

  20. [20]

    Model soups: av- eraging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, M. T. Ribeiro, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Model soups: av- eraging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” in International Conference on Machine Learning , pp. 23965–23998, 2022

  21. [21]

    Robust fine-tuning of zero-shot mod- els,

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Ko- rnblith, R. Roelofs, A. Farhadi, H. Hajishirzi, and L. Schmidt, “Robust fine-tuning of zero-shot mod- els,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022

  22. [22]

    Diverse weight averaging for out-of-distribution generalization,

    A. Rame, M. Kirchmeyer, T. Rahier, A. Rakotoma- monjy, P. Gallinari, and M. Cord, “Diverse weight averaging for out-of-distribution generalization,” in Advances in Neural Information Processing Sys- tems, pp. 9821–9834, 2022

  23. [23]

    Dataless knowl- edge fusion by merging weights of language mod- els,

    X. Jin, P. Ren, R. Liao, and R. Lyu, “Dataless knowl- edge fusion by merging weights of language mod- els,” in International Conference on Learning Rep- resentations, 2022

  24. [24]

    Locating and editing factual associations in gpt,

    K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,” in Advances in Neural Information Processing Sys- tems, 2022

  25. [25]

    Continual learning in vision-language mod- els via aligned model merging,

    G. Sokar, D. C. Mocanu, and M. Pechenizkiy, “Continual learning in vision-language mod- els via aligned model merging,” arXiv preprint arXiv:2401.10421, 2024

  26. [26]

    Continual learning through orthogonal weight modification,

    G. Zeng, Y . Chen, B. Cui, and S. Yu, “Continual learning through orthogonal weight modification,” in International Joint Conference on Artificial Intel- ligence, pp. 2921–2927, 2018

  27. [27]

    Mass editing memory in a trans- former,

    K. Meng, A. S. Sharma, A. Andonian, Y . Belinkov, and D. Bau, “Mass editing memory in a trans- former,” in International Conference on Learning Representations, 2023

  28. [28]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  29. [29]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in Ad- vances in Neural Information Processing Systems , pp. 8024–8035, 2019

  30. [30]

    L. N. Trefethen and D. Bau III, Numerical linear al- gebra, vol. 50. SIAM, 1997

  31. [31]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” technical re- port, University of Toronto, 2009

  32. [32]

    Character-level convolutional networks for text classification,

    X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Sys- tems, pp. 649–657, 2015. 11