Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition
Pith reviewed 2026-05-19 02:11 UTC · model grok-4.3
The pith
Encoding each fine-tuned model as an orthogonal delta from a shared base allows merging them without interference while supporting continual addition and reversible removal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MDM-OC encodes each task-specific model as a delta from a shared base model, projects each delta into an orthogonal subspace to remove conflicts, and merges the projected deltas through gradient-based optimization to produce a single model that retains accuracy on all tasks while supporting continual addition of new models and structured unmerging.
What carries the argument
Orthogonal projection of task deltas into separate subspaces, which ensures no parameter-space overlap between tasks during merging.
If this is right
- New models can be added continually without retraining or forgetting prior tasks.
- Specific models can be unmerged on demand to satisfy compliance requirements such as GDPR.
- The merged model maintains higher accuracy and backward transfer than prior merging baselines on vision and language benchmarks.
- Merging stays memory-efficient and computationally practical at scale.
Where Pith is reading between the lines
- The same orthogonal-delta approach could extend to combining models trained on entirely different data modalities.
- Dynamic model libraries become feasible in which users add and drop capabilities without storing full copies of every variant.
- Integration with privacy techniques might allow selective removal of data influences without full retraining.
Load-bearing premise
Forcing task deltas into orthogonal subspaces removes interference between tasks without causing large losses in each task's specific performance.
What would settle it
A merged model that shows clear drops in accuracy on any individual task compared with the separate fine-tuned version, or an unmerge step that fails to restore the original task-specific performance.
Figures
read the original abstract
In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Modular Delta Merging with Orthogonal Constraints (MDM-OC), a framework for scalable, interference-free, and reversible composition of fine-tuned models. Each task model is encoded as a delta from a shared base, projected into mutually orthogonal subspaces to eliminate conflicts, and merged via gradient-based optimization. The approach incorporates elastic weight consolidation and synthetic replay for stability, supports continual addition of new models and structured unmerging for compliance (e.g., GDPR), and claims to outperform baselines on vision and NLP benchmarks in accuracy, backward transfer, and unmerge fidelity while remaining memory-efficient.
Significance. If the central claims hold, this would represent a meaningful contribution to continual learning and model merging by addressing task interference and enabling reversibility without requiring task data at merge time. The combination of orthogonal constraints with gradient merging and replay mechanisms could support practical modular AI systems where models must be composed, updated, and selectively undone.
major comments (2)
- The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.
- The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.
minor comments (1)
- Clarify the notation for the projection operator and delta encoding early in the method section to improve readability for readers unfamiliar with the construction.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: The method description of projecting task deltas into orthogonal subspaces provides no explicit bound on retained norm, no analysis of projection error, and no ablation isolating the projection step. This is load-bearing for the interference-free claim, as shared directions between task vectors (common in related tasks) can discard substantial signal in the orthogonal complement, potentially requiring task data or surrogates for recovery during gradient optimization—contradicting the positioning that such data is unnecessary at merge time.
Authors: We appreciate this insightful observation. To address the lack of explicit bounds and analysis, we have added a new subsection in the Methods (Section 3.2) providing a bound on the retained norm: the projected delta retains at least (1 - max cosine similarity) of the original norm in expectation, with a full derivation of the projection error. We also include an ablation study in the Experiments section (Table 4) isolating the orthogonal projection step, showing significant improvement in interference reduction when enabled. Regarding the potential requirement for task data: the gradient-based merging optimizes the merged model using only the orthogonalized deltas and the base model parameters, without access to task-specific data or surrogates. The orthogonality ensures minimal overlap, and any residual interference is mitigated by the EWC and replay mechanisms which are applied during initial fine-tuning, not at merge time. We have clarified this distinction in the revised text to avoid any misinterpretation. revision: yes
-
Referee: The abstract asserts outperformance in accuracy, backward transfer, and unmerge fidelity but supplies no quantitative results, error bars, ablation studies, or derivation details. Specific tables or figures with these comparisons to baselines are required to support the central experimental claims.
Authors: We thank the referee for pointing this out. While the abstract is intended as a concise summary, the full quantitative results are detailed in the Experiments section. Specifically, Table 1 reports accuracy and backward transfer on vision benchmarks (CIFAR-100, ImageNet subsets) with error bars from 5 random seeds. Table 2 covers NLP tasks (GLUE subsets). Figure 3 shows unmerge fidelity metrics. Ablation studies are in Table 5, including variants without orthogonal constraints. All comparisons to baselines (Task Arithmetic, TIES-Merging, etc.) are provided with statistical significance tests. We believe these support the claims, but if the editor deems it necessary, we can add a summary table in the main text or highlight key numbers more prominently. revision: partial
Circularity Check
No circularity: framework introduced as novel construction
full rationale
The paper proposes MDM-OC as a new method that encodes fine-tuned models as deltas from a base, projects them into orthogonal subspaces, and merges via gradient optimization, with added stability via EWC and synthetic replay. No equations, parameters, or claims are shown to reduce by construction to fitted inputs, self-citations, or prior results from the same authors. The derivation is presented as an original framework rather than a statistical prediction or renamed known result, making the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Orthogonal projection of task deltas eliminates interference while preserving individual task performance
invented entities (1)
-
MDM-OC framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sequential Gram-Schmidt orthogonalization projects each delta onto the null space of previously processed deltas
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neu- ral Information Processing Systems, pp. 5998–6008, 2017
work page 2017
-
[2]
An im- age is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An im- age is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021
work page 2021
-
[3]
Communication-efficient learn- ing of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learn- ing of deep networks from decentralized data,” in Artificial Intelligence and Statistics, pp. 1273–1282, 2017
work page 2017
-
[4]
Multiagent systems: A survey from a machine learning perspective,
P. Stone and M. Veloso, “Multiagent systems: A survey from a machine learning perspective,” Au- tonomous Robots, vol. 8, no. 3, pp. 345–383, 2000
work page 2000
-
[5]
Hidden technical debt in machine learning sys- tems,
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V . Chaudhary, and M. Young, “Hidden technical debt in machine learning sys- tems,” in Advances in Neural Information Process- ing Systems, pp. 2503–2511, 2015
work page 2015
-
[6]
P. V oigt and A. V on dem Bussche, The EU gen- eral data protection regulation (GDPR): A practical guide. Springer, 2017
work page 2017
-
[7]
Catastrophic in- terference in connectionist networks: The sequential learning problem,
M. McCloskey and N. J. Cohen, “Catastrophic in- terference in connectionist networks: The sequential learning problem,” in Psychology of Learning and Motivation, vol. 24, pp. 109–165, 1989
work page 1989
-
[8]
Catastrophic forgetting in connec- tionist networks,
R. M. French, “Catastrophic forgetting in connec- tionist networks,” Trends in Cognitive Sciences , vol. 3, no. 4, pp. 128–135, 1999
work page 1999
-
[9]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “Lora: Low-rank adap- tation of large language models,” arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Editing Models with Task Arithmetic
G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Ties-merging: Resolving interference when merging models,
P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” in Advances in Neural In- formation Processing Systems, 2023
work page 2023
-
[12]
Adapterfusion: Non-destructive task composition for transfer learning,
J. Pfeiffer, A. Kamath, A. R ¨uckl´e, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” in Conference of the European Chapter of the Association for Com- putational Linguistics, pp. 487–503, 2021
work page 2021
-
[13]
Over- coming catastrophic forgetting in neural networks,
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Ve- ness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Over- coming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences , vol. 114, no. 13, pp. 3521–3526, 2017. 10
work page 2017
-
[14]
Continual learning through synaptic intelligence,
F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in Interna- tional Conference on Machine Learning, pp. 3987– 3995, 2017
work page 2017
-
[15]
Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2935–2947, 2017
work page 2017
-
[16]
Or- thogonal gradient descent for continual learning,
M. Farajtabar, N. Azizan, A. Mott, and A. Li, “Or- thogonal gradient descent for continual learning,” in International Conference on Artificial Intelligence and Statistics, pp. 3762–3773, 2020
work page 2020
-
[17]
Gradient episodic memory for continual learning,
D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” inAdvances in Neu- ral Information Processing Systems, pp. 6467–6476, 2017
work page 2017
-
[18]
Efficient Lifelong Learning with A-GEM
A. Chaudhry, M. Ranzato, M. Rohrbach, and M. El- hoseiny, “Efficient lifelong learning with a-gem,” arXiv preprint arXiv:1812.00420, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Gradient projection memory for continual learning,
G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual learning,” in International Conference on Learning Representations, 2021
work page 2021
-
[20]
M. Wortsman, G. Ilharco, M. T. Ribeiro, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Model soups: av- eraging weights of multiple fine-tuned models im- proves accuracy without increasing inference time,” in International Conference on Machine Learning , pp. 23965–23998, 2022
work page 2022
-
[21]
Robust fine-tuning of zero-shot mod- els,
M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Ko- rnblith, R. Roelofs, A. Farhadi, H. Hajishirzi, and L. Schmidt, “Robust fine-tuning of zero-shot mod- els,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022
work page 2022
-
[22]
Diverse weight averaging for out-of-distribution generalization,
A. Rame, M. Kirchmeyer, T. Rahier, A. Rakotoma- monjy, P. Gallinari, and M. Cord, “Diverse weight averaging for out-of-distribution generalization,” in Advances in Neural Information Processing Sys- tems, pp. 9821–9834, 2022
work page 2022
-
[23]
Dataless knowl- edge fusion by merging weights of language mod- els,
X. Jin, P. Ren, R. Liao, and R. Lyu, “Dataless knowl- edge fusion by merging weights of language mod- els,” in International Conference on Learning Rep- resentations, 2022
work page 2022
-
[24]
Locating and editing factual associations in gpt,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,” in Advances in Neural Information Processing Sys- tems, 2022
work page 2022
-
[25]
Continual learning in vision-language mod- els via aligned model merging,
G. Sokar, D. C. Mocanu, and M. Pechenizkiy, “Continual learning in vision-language mod- els via aligned model merging,” arXiv preprint arXiv:2401.10421, 2024
-
[26]
Continual learning through orthogonal weight modification,
G. Zeng, Y . Chen, B. Cui, and S. Yu, “Continual learning through orthogonal weight modification,” in International Joint Conference on Artificial Intel- ligence, pp. 2921–2927, 2018
work page 2018
-
[27]
Mass editing memory in a trans- former,
K. Meng, A. S. Sharma, A. Andonian, Y . Belinkov, and D. Bau, “Mass editing memory in a trans- former,” in International Conference on Learning Representations, 2023
work page 2023
-
[28]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[29]
Pytorch: An imperative style, high-performance deep learning library,
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in Ad- vances in Neural Information Processing Systems , pp. 8024–8035, 2019
work page 2019
-
[30]
L. N. Trefethen and D. Bau III, Numerical linear al- gebra, vol. 50. SIAM, 1997
work page 1997
-
[31]
Learning multiple layers of features from tiny images,
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” technical re- port, University of Toronto, 2009
work page 2009
-
[32]
Character-level convolutional networks for text classification,
X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Sys- tems, pp. 649–657, 2015. 11
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.