pith. machine review for the scientific record. sign in

arxiv: 2604.17078 · v1 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords task arithmeticweight disentanglementtask-feature specializationorthogonal regularizationmodel editingfine-tuningtask vectorsmodel merging
0
0 comments X

The pith

Task-feature specialization causes weight disentanglement in task arithmetic and can be strengthened by enforcing orthogonality on task vectors during fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that weight disentanglement, the ability to combine task vectors without interference in pre-trained models, arises because the model allocates distinct internal features to each task. This internal property also produces orthogonal weight vectors as a direct geometric outcome. Because the internal property cannot be enforced directly, the authors introduce OrthoReg, a regularization term applied to weight updates while fine-tuning each task. They prove that OrthoReg increases orthogonality and thereby promotes disentanglement. A reader would care because task arithmetic offers a training-free way to edit and compose model capabilities, and clarifying its mechanism makes such edits more predictable and effective.

Core claim

We introduce Task-Feature Specialization (TFS) as the fundamental principle underlying weight disentanglement. We prove that TFS is a sufficient condition for weight disentanglement and that it also produces weight vector orthogonality. This link lets us promote disentanglement indirectly by shaping the observable geometry: OrthoReg regularizes the weight updates that form each task vector to be orthogonal, and we prove this regularization promotes disentanglement. Experiments show OrthoReg consistently improves the performance of multiple task arithmetic methods.

What carries the argument

Task-Feature Specialization (TFS), defined as a model's allocation of distinct internal features to different tasks, which suffices for weight disentanglement and implies orthogonality of the resulting task vectors.

If this is right

  • OrthoReg applied during fine-tuning increases orthogonality and thereby improves disentanglement.
  • Multiple existing task arithmetic techniques gain performance when their task vectors are produced under OrthoReg.
  • Weight vector orthogonality functions as a measurable proxy for the harder-to-observe TFS property.
  • The theoretical chain from TFS to disentanglement to orthogonality supplies a causal account for why task arithmetic succeeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If orthogonality reliably signals successful disentanglement, measuring angles between task vectors could serve as a quick diagnostic before full composition tests.
  • The same geometric regularization idea might be adapted to other model-merging settings that rely on additive updates.
  • Models fine-tuned under OrthoReg could support larger numbers of composable tasks before interference appears.
  • Whether TFS arises spontaneously in certain pre-training regimes or architectures could be tested by checking orthogonality of natural task vectors.

Load-bearing premise

That actively making weight updates orthogonal during fine-tuning will induce or strengthen the internal property of task-feature specialization.

What would settle it

An experiment in which OrthoReg produces clearly orthogonal task vectors yet the composed model still shows interference and accuracy drops when tasks are combined.

Figures

Figures reproduced from arXiv: 2604.17078 by Dacheng Tao, Lei Wang, Qi Fan, Shangge Liu, Wenbin Li, Yang Gao, Yinghuan Shi, Yuehan Yin.

Figure 1
Figure 1. Figure 1: Conceptual illustration of our central thesis: Task [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical evidence of weight vector orthogonality in a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of the OrthoReg method. It mitigates task [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The accuracy of merged models (ViT-L-14) across the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of hyperparameter sensitivity on ViT-B-16. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Angle distributions of weight matrix columns for all layers in ViT-B/16. Each subplot displays a histogram of the angles [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The accuracy of merged models across the eight benchmark tasks for different ViT architectures. Each subplot shows the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise cosine similarity heatmaps of task vectors for ViT-B-16 across different methods. The top row shows the baseline [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pairwise cosine similarity heatmaps of task vectors for VIT-B-32 across different methods. The top row shows the baseline [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pairwise cosine similarity heatmaps of task vectors for ViT-L-14 across different methods. The top row shows the baseline [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
read the original abstract

Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($\theta_0$) or the task vectors ($\tau_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ($\Delta W$) that constitute $\tau_t$ during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods. Code is available at \href{https://github.com/RL-MIND/OrthoReg}{https://github.com/RL-MIND/OrthoReg}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Task-Feature Specialization (TFS) as the fundamental principle underlying weight disentanglement in task arithmetic. It proves TFS is a sufficient condition for disentanglement and that TFS implies orthogonality of task vectors. It then proposes OrthoReg, a regularization method that enforces orthogonality on weight updates during fine-tuning, claims a theoretical proof that this promotes disentanglement, and reports consistent experimental improvements across task arithmetic methods, with code released.

Significance. If the proofs and the bidirectional link hold, the work supplies a theoretical foundation for task arithmetic by identifying a common cause (TFS) for both functional disentanglement and a measurable geometric property, while delivering a simple, reproducible regularization technique that improves existing methods. The explicit proofs for TFS sufficiency and the availability of code are clear strengths supporting further development in model editing.

major comments (2)
  1. [Abstract and theoretical analysis of OrthoReg] Abstract and theoretical analysis of OrthoReg: the claim that OrthoReg promotes disentanglement rests on enforcing orthogonality in the updates that form task vectors, yet the provided reasoning only establishes the forward direction (TFS implies orthogonality and disentanglement). The manuscript must supply the explicit step or assumption (e.g., on feature-map linearity or loss convexity) that converts the geometric constraint back into a guarantee of TFS; without it the regularization could produce orthogonal yet still entangled representations, undermining the central justification for the method.
  2. [Proof of TFS sufficiency] Proof of TFS sufficiency (theoretical section): while the sufficiency direction is stated, the manuscript should clarify whether the derivation assumes a specific model class (linear heads, etc.) or holds generally; any such assumption is load-bearing for the claim that TFS is the intrinsic cause of disentanglement across architectures.
minor comments (1)
  1. Notation for task vectors and updates (throughout): the distinction between pre-trained weights, task vectors, and fine-tuning updates could be made more explicit in the first use of each symbol to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful and constructive comments on our manuscript. We have carefully reviewed the major points concerning the theoretical analysis of OrthoReg and the assumptions underlying the TFS sufficiency proof. Below we respond point-by-point, indicating the revisions we will make to address the concerns while preserving the core contributions.

read point-by-point responses
  1. Referee: Abstract and theoretical analysis of OrthoReg: the claim that OrthoReg promotes disentanglement rests on enforcing orthogonality in the updates that form task vectors, yet the provided reasoning only establishes the forward direction (TFS implies orthogonality and disentanglement). The manuscript must supply the explicit step or assumption (e.g., on feature-map linearity or loss convexity) that converts the geometric constraint back into a guarantee of TFS; without it the regularization could produce orthogonal yet still entangled representations, undermining the central justification for the method.

    Authors: We agree that the manuscript establishes the forward direction (TFS implies orthogonality and disentanglement) but does not fully close the loop with a converse guarantee. OrthoReg is motivated by orthogonality being a necessary geometric consequence of TFS; enforcing it during fine-tuning serves as a practical proxy to encourage feature specialization. A strict bidirectional proof would indeed require additional assumptions, such as linearity of the feature maps or convexity of the loss. We have revised the abstract and theoretical section to explicitly acknowledge this limitation, clarify that OrthoReg promotes the observable geometric property tightly associated with TFS (rather than claiming a full guarantee), and add a brief discussion of the conditions needed for the converse. The empirical improvements in disentanglement metrics and task arithmetic performance across architectures remain as supporting evidence. revision: yes

  2. Referee: Proof of TFS sufficiency (theoretical section): while the sufficiency direction is stated, the manuscript should clarify whether the derivation assumes a specific model class (linear heads, etc.) or holds generally; any such assumption is load-bearing for the claim that TFS is the intrinsic cause of disentanglement across architectures.

    Authors: The TFS sufficiency proof is derived in a general setting for neural networks where task vectors are added to a pre-trained model and functional behavior depends on distinct feature allocation. It does not restrict to linear heads but assumes standard additive composition in weight space and that task performance is governed by the specialized internal representations. We have revised the theoretical section to explicitly enumerate these modeling assumptions and to state that the result holds for the transformer and CNN architectures evaluated in the paper, while noting that applicability to other model classes would require analogous verification. This clarification reinforces rather than weakens the positioning of TFS as a fundamental principle. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via new definitions and one-way proofs

full rationale

The paper introduces TFS as a novel internal property, proves it sufficient for weight disentanglement, derives orthogonality as a geometric consequence, and proposes OrthoReg to enforce orthogonality as a tractable proxy while claiming a separate theoretical proof that the regularization promotes disentanglement. These steps rely on explicit definitions and logical implications rather than redefining the target outcome in terms of itself or fitting parameters to the final metric. No self-citation chains, ansatz smuggling, or uniqueness theorems imported from prior author work are present in the provided text. The reverse direction (orthogonality implying TFS) is asserted via the new proof rather than assumed by construction, keeping the chain non-circular even if the proof's validity is debatable on other grounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the newly introduced TFS property and standard linear-algebra facts about orthogonality; no free parameters are mentioned.

axioms (1)
  • standard math Vector orthogonality implies non-interference in linear weight updates
    Invoked to connect the geometric property to functional disentanglement.
invented entities (1)
  • Task-Feature Specialization (TFS) no independent evidence
    purpose: Fundamental principle that causes weight disentanglement
    Newly defined concept introduced to explain the success of task arithmetic.

pith-pipeline@v0.9.0 · 5608 in / 1254 out tokens · 72486 ms · 2026-05-10T06:31:06.086136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Absil, R

    P.A. Absil, R. Mahony, and R. Sepulchre.Optimization Al- gorithms on Matrix Manifolds. Princeton University Press,

  2. [2]

    Uni- tary evolution recurrent neural networks

    Mart ´ın Arjovsky, Amar Shah, and Yoshua Bengio. Uni- tary evolution recurrent neural networks. InProceedings of the 33nd International Conference on Machine Learning (ICML), pages 1120–1128, 2016. 2, 19

  3. [3]

    Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. 15, 16

  4. [4]

    Remote sens- ing image scene classification: Benchmark and state of the art.Proc

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proc. IEEE, 105(10):1865–1883, 2017. 6

  5. [5]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613, 2014. 6

  6. [6]

    Coxeter and S.L

    H.S.M. Coxeter and S.L. Greitzer.Geometry Revisited. Mathematical Association of America, 1967. 19

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021. 1

  8. [8]

    The approximation of one ma- trix by another of lower rank.Psychometrika, 1(3):211–218,

    Carl Eckart and Gale Young. The approximation of one ma- trix by another of lower rank.Psychometrika, 1(3):211–218,

  9. [9]

    Task singular vectors: Reducing task interference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodol `a. Task singular vectors: Reducing task interference in model merging. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18695–18705, 2025. 2

  10. [10]

    Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification. InIEEE International Conference on Computer Vision (ICCV), pages 1026–1034,

  11. [11]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226,

  12. [12]

    Dissecting fine-tuning unlearning in large language models

    Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang. Dissecting fine-tuning unlearning in large language models. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 3933–3941, 2024. 1

  13. [13]

    Horn and C.R

    R.A. Horn and C.R. Johnson.Matrix Analysis. Cambridge University Press, 1990. 18, 21

  14. [14]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations (ICLR), 2022. 7, 12

  15. [15]

    Decor- related batch normalization

    Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decor- related batch normalization. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 791–800. Com- puter Vision Foundation / IEEE Computer Society, 2018. 15, 16

  16. [16]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco T ´ulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Con- ference on Learning Representations (ICLR), 2023. 1, 2, 3, 6, 7, 8, 12, 23, 26, 27

  17. [17]

    Batch normalization: Accelerating deep network training by reducing internal co- variate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. InProceedings of the 32nd International Con- ference on Machine Learning, ICML 2015, Lille, France, 6- 11 July 2015, pages 448–456. JMLR.org, 2015. 15, 16

  18. [18]

    Neu- ral tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Cl ´ement Hongler, and Franck Gabriel. Neu- ral tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), pages 8580–8589, 2018. 3

  19. [19]

    Su, and Li Shen

    Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie J. Su, and Li Shen. Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic. InInternational Conference on Learning Representations (ICLR), 2025. 2, 6, 7, 23, 26, 27

  20. [20]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020. 3, 12

  21. [21]

    Collecting a large-scale dataset of fine-grained cars

    Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, 2013. 6, 25

  22. [22]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition.Proceed- ings of the IEEE, 86(11):2278–2324, 1998. 6

  23. [23]

    Re- visiting catastrophic forgetting in large language model tun- ing

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Re- visiting catastrophic forgetting in large language model tun- ing. InFindings of the Association for Computational Lin- guistics: EMNLP 2024, pages 4297–4308, 2024. 1

  24. [24]

    When is task vector provably effec- tive for model editing? A generalization analysis of nonlin- ear transformers

    Hongkang Li, Yihua Zhang, Shuai Zhang, Pin-Yu Chen, Sijia Liu, and Meng Wang. When is task vector provably effec- tive for model editing? A generalization analysis of nonlin- ear transformers. InInternational Conference on Learning Representations (ICLR), 2025. 2

  25. [25]

    Luenberger and Y

    D.G. Luenberger and Y . Ye.Linear and Nonlinear Program- ming. Springer US, 2008. 20

  26. [26]

    Pre-trained large language mod- els for financial sentiment analysis.CoRR, abs/2401.05215,

    Wei Luo and Dihong Gong. Pre-trained large language mod- els for financial sentiment analysis.CoRR, abs/2401.05215,

  27. [27]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.CoRR, abs/2308.08747, 2023. 1

  28. [28]

    Nikita Makarov, Maria Bordukova, Papichaya Quengdaeng, Daniel Garger, Raul Rodriguez-Esteban, Fabian Schmich, and Michael P. Menden. Large language models forecast patient health trajectories enabling digital twins.npj Digit. Medicine, 8(1), 2025. 1

  29. [29]

    Spectral normalization for generative ad- versarial networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative ad- versarial networks. InInternational Conference on Learning Representations (ICLR), 2018. 2, 3, 19

  30. [30]

    On the stability of fine-tuning BERT: miscon- ceptions, explanations, and strong baselines

    Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning BERT: miscon- ceptions, explanations, and strong baselines. In9th Inter- national Conference on Learning Representations (ICLR),

  31. [31]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learn- ing, page 7. Granada, 2011. 6

  32. [32]

    Task arithmetic in the tangent space: Improved editing of pre-trained models

    Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. InAdvances in Neural Infor- mation Processing Systems, pages 66727–66754, 2023. 2, 3, 6, 7, 8, 12, 22, 23, 26, 27

  33. [33]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 1, 6

  34. [34]

    Rivlin.The Chebyshev Polynomials

    T.J. Rivlin.The Chebyshev Polynomials. Wiley, 1974. 19

  35. [35]

    Rudin.Principles of Mathematical Analysis

    W. Rudin.Principles of Mathematical Analysis. McGraw- Hill, 1976. 19

  36. [36]

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help op- timization? InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Pro- cessing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada, pages 2488–2498, 2018. 16

  37. [37]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Gan- guli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InInternational Conference on Learning Representations (ICLR), 2014. 2

  38. [38]

    The german traffic sign recognition bench- mark: A multi-class classification competition

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition bench- mark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks (IJCNN), pages 1453–1460. IEEE, 2011. 6

  39. [39]

    Model merging with SVD to tie the knots

    George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with SVD to tie the knots. InInternational Conference on Learning Rep- resentations (ICLR), 2025. 1

  40. [40]

    Parameter-efficient multi-task model fusion with partial linearization

    Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter-efficient multi-task model fusion with partial linearization. InInter- national Conference on Learning Representations (ICLR),

  41. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971, 2023. 1

  42. [42]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 12

  43. [43]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational Con- ference on Machine Learning ...

  44. [44]

    Ehinger, Aude Oliva, and Antonio Torralba

    Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3485– 3492, 2010. 6

  45. [45]

    Multi-task model merging via adaptive weight disentanglement.CoRR, abs/2411.18729, 2024

    Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. Multi-task model merging via adaptive weight disentanglement.CoRR, abs/2411.18729, 2024. 14

  46. [46]

    Raf- fel, and Mohit Bansal

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2, 3, 12

  47. [47]

    arXiv preprint arXiv:2408.07666 , year=

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.CoRR, abs/2408.07666, 2024. 1, 2

  48. [48]

    Adamerging: Adap- tive model merging for multi-task learning

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adap- tive model merging for multi-task learning. InInternational Conference on Learning Representations (ICLR), 2024. 1, 2

  49. [49]

    Efficient adaptation of pre-trained vision transformer underpinned by approximately orthogonal fine- tuning strategy.CoRR, abs/2507.13260, 2025

    Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, and Hengtao Shen. Efficient adaptation of pre-trained vision transformer underpinned by approximately orthogonal fine- tuning strategy.CoRR, abs/2507.13260, 2025. 2, 19

  50. [50]

    Editing large language models: Problems, methods, and op- portunities

    Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and op- portunities. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 10222–10240, 2023. 1

  51. [51]

    McAuley, and Hiroki Naganuma

    Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryosuke Ya- maki, Ryotaro Shimizu, Yuki Saito, Julian J. McAuley, and Hiroki Naganuma. Mastering task arithmetic: τjp as a key indicator for weight disentanglement. InInternational Con- ference on Learning Representations (ICLR), 2025. 2, 3

  52. [52]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, pages 1593–1603. IEEE, 2024. 30

  53. [53]

    Efficient model editing with task vector bases: A theoretical framework and scalable approach.CoRR, abs/2502.01015, 2025

    Siqi Zeng, Yifei He, Weiqiu You, Yifan Hao, Yao-Hung Hu- bert Tsai, Makoto Yamada, and Han Zhao. Efficient model editing with task vector bases: A theoretical framework and scalable approach.CoRR, abs/2502.01015, 2025. 2

  54. [54]

    Unraveling lora interference: Orthogonal subspaces for robust model merging

    Haobo Zhang and Jiayu Zhou. Unraveling lora interference: Orthogonal subspaces for robust model merging. InProceed- ings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 26459– 26472. Association for Computational Linguistics, 2025. 2

  55. [55]

    arXiv preprint arXiv:2201.05337 , year=

    Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song. A survey of controllable text generation us- ing transformer-based pre-trained language models.CoRR, abs/2201.05337, 2022. 1

  56. [56]

    in-domain dis- entanglement

    Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed- forward layers are mixtures of experts. InFindings of the As- sociation for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 877–890. Association for Computational Linguistics, 2022. 3, 12 Understanding and Enforcing Weig...

  57. [57]

    (68) This implies that the squared Euclidean distance between the random vectorJ(x)and its meanµ J is bounded by C 2σ2 J with a probability of at least1−1/C 2

    (67) By Chebyshev’s inequality [34], for any constantC >1, we have, P ∥J(x)−µ J ∥2 2 ≥C 2σ2 J ≤ E ∥J(x)−µ J ∥2 2 C 2σ2 J = σ2 J C 2σ2 J = 1 C 2 . (68) This implies that the squared Euclidean distance between the random vectorJ(x)and its meanµ J is bounded by C 2σ2 J with a probability of at least1−1/C 2. In other words, for a “typical” (i.e., high-probabi...