Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

Angelo Porrello; Felix Dangel; Lorenzo Bonicelli; Pietro Buzzega; Riccardo Salami; Simone Calderara; Thomas Sommariva

arxiv: 2602.17385 · v3 · pith:POMZ7PMXnew · submitted 2026-02-19 · 💻 cs.AI

Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

Angelo Porrello , Pietro Buzzega , Felix Dangel , Thomas Sommariva , Riccardo Salami , Lorenzo Bonicelli , Simone Calderara This is my paper

Pith reviewed 2026-05-22 11:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords task arithmeticrepresentation driftK-FACcurvature approximationdataless regularizationmodel mergingfoundation modelsweight disentanglement

0 comments

The pith

Task vectors can be disentangled without any task data by using a Kronecker-factored curvature approximation as a regularizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Task arithmetic adapts foundation models by combining task-specific vectors, but mixing them produces interference that shifts representations and hurts results. Standard fixes for this drift require holding out task data, which conflicts with privacy and modularity goals. The paper reframes the regularization goal as a curvature-matrix approximation problem so that established second-order methods can be applied directly. Adopting Kronecker-Factored Approximate Curvature yields a data-free regularizer that reaches state-of-the-art accuracy on task addition and negation while keeping cost constant as the task count grows and removing any need for held-out tuning.

Core claim

By casting representation-drift regularization as a curvature-matrix approximation task, Kronecker-Factored Approximate Curvature supplies a practical, dataless regularizer for task vectors that preserves performance under addition and negation.

What carries the argument

Kronecker-Factored Approximate Curvature applied as a regularizer that penalizes deviation from the original curvature structure when task vectors are combined.

If this is right

Task addition and negation reach state-of-the-art accuracy without external data.
Computational cost stays constant as the number of tasks increases.
Task-vector rescaling no longer requires separate tuning on held-out data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curvature-based framing could be tested on other model-merging techniques that currently rely on data to control interference.
Because the method is data-free it may be directly usable in privacy-sensitive settings where task examples cannot be stored or shared.
Replacing K-FAC with a more accurate curvature estimator could be checked to see whether disentanglement improves further.

Load-bearing premise

The curvature matrix approximated by K-FAC is a sufficient proxy for representation drift that can be regularized without any task data.

What would settle it

A direct comparison in which the K-FAC regularizer is applied to a multi-task vector combination and the resulting accuracy on a held-out test set falls below that of a data-dependent baseline or fails to stay stable under task-vector rescaling.

Figures

Figures reproduced from arXiv: 2602.17385 by Angelo Porrello, Felix Dangel, Lorenzo Bonicelli, Pietro Buzzega, Riccardo Salami, Simone Calderara, Thomas Sommariva.

**Figure 1.** Figure 1: Weight disentanglement (left) without and (right) with Jacobian Gram regularization. efficient curvature approximations. Specifically, we adopt Kronecker-factored approximate curvature (KFAC, Martens & Grosse, 2015), a block-diagonal approximation of the GGN, where blocks correspond to layers and each block is a Kronecker product of two small matrices. KFAC drastically reduces storage and computation whi… view at source ↗

**Figure 2.** Figure 2: Impact of regularization on “8 Vision” — CLIP ViT-B/16 (abs. accuracy). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Results for language tasks. Left: impact of different training strategies and sensitivity to α hyperparameter. Right: effects of different regularizations on linear and non-linear fine-tuning. In this regard, although our regularization is not theoretically exact in the non-linear regime, its applicability can still be justified whenever linearized behavior is implicitly enforced. For this reason, in the … view at source ↗

**Figure 4.** Figure 4: For ViT-B/32 (8 Vision), we analyze the sensitivity of different merging strategies to the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of ∥Jθf(x, θ0)τt∥ 2 2 for inputs originating from the training distribution of task t (inliers) versus from other tasks (outliers), under both regularized and non-regularized FT. Comparison of model merging strategies [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of the overhead of KFAC regularization during training and pre-computation. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of KFAC approximation efficiency on performance. Left: impact of the number [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Applying the KFAC loss every N steps. introducing prohibitive overhead. To do so, we evaluate a regime where the penalty loss is computed and backpropagated only once every N training steps. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of weight disentanglement (Ortiz-Jimenez et al., 2023) in ViT-B/16. Non [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Impact of training and regularization choices on vision tasks (absolute accuracy). Top: [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: For ViT-B/16 (8 Vision), we analyze the sensitivity of different merging strategies to [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Sensitivity to the scaling coefficient α in the non-linear fine-tuning regime. We report α-sweep results for ViT-B/32 (left) and ViT-B/16 (right), comparing standard non-linear fine-tuning, attention-only fine-tuning Jin et al. (2025), and its variant regularized with the KFAC [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Task localization under non-linear fine-tuning. We report the distribution of the Jacobian-projected normalcy scores ∥Jθf(x, θ0) τt∥ 2 2 for inputs belonging to task t (in-task) versus inputs from all other tasks (out-of-task). The results show a consistent pattern across all datasets. Our method maintains a clear and sharp separation between in-distribution and out-of-distribution examples, closely mirr… view at source ↗

**Figure 14.** Figure 14: Task localization under linearized fine-tuning with block-compressed KFAC. The sepa [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

read the original abstract

Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a dataless K-FAC regularizer for task vectors that claims constant complexity and SOTA results, but the core proxy assumption still needs direct checks.

read the letter

This paper's main point is that you can regularize task vectors against representation drift without any task data by treating the problem as a curvature matrix approximation and plugging in K-FAC. That removes the usual need for held-out data or privacy issues when you want to add or subtract vectors from a base model. They report state-of-the-art numbers on addition and negation while keeping the cost independent of how many tasks you combine and making the method less sensitive to vector scaling, so no extra tuning step is required. The framing is a direct but useful combination of task arithmetic ideas with an established curvature approximation, and the constant-complexity claim plus the robustness to rescaling are the practical wins that stand out. The experiments appear to back the performance claims on the benchmarks they chose. The soft spot is the central assumption that K-FAC factors computed on the base model (or a small calibration set) will stay a faithful enough stand-in for the actual drift that occurs once you apply the task vectors. If the approximation error grows with vector magnitude or model depth, the regularizer could either under-correct interference or add unhelpful constraints that hurt single-task accuracy. The abstract does not spell out controls for that, so the results rest on how well the proxy holds in the full runs. This work is aimed at people doing modular editing or merging of foundation models, especially in data-scarce or privacy-sensitive settings. A reader already working with task vectors would pick up a concrete implementation trick and some empirical evidence that the approach can outperform earlier regularizers. It is solid enough on the technical side to deserve a serious referee, mainly so the approximation quality and the experimental controls can be examined in detail.

Referee Report

1 major / 2 minor

Summary. The paper proposes a dataless regularization method for task arithmetic in foundation models to mitigate cross-task interference and representation drift. By framing the regularization as a curvature-matrix approximation problem and adopting Kronecker-Factored Approximate Curvature (K-FAC) computed on the base model, the authors derive a practical regularizer claimed to achieve state-of-the-art performance on task addition and negation. The approach is asserted to have constant complexity in the number of tasks, promote robustness to task-vector rescaling, and eliminate the need for held-out tuning data.

Significance. If the central empirical claims hold and the K-FAC proxy is validated as a faithful surrogate for representation drift without task data, the work would meaningfully advance modular adaptation of foundation models under privacy or data-availability constraints. It usefully repurposes an established second-order approximation technique to remove a key practical barrier in task arithmetic.

major comments (1)

The load-bearing premise that K-FAC factors computed solely on the base model (or a small calibration set) remain a sufficient proxy for the actual change in hidden activations induced by task vectors is not accompanied by direct validation of approximation error as a function of task-vector magnitude or network depth. Without such controls, it is unclear whether the regularizer under-penalizes interference or introduces spurious constraints that could degrade single-task performance.

minor comments (2)

Clarify in the methods section exactly how the K-FAC factors are computed (e.g., on which layers and with what damping) and whether any small calibration set is used.
Add error bars and explicit baseline details (including hyperparameter tuning budgets) to all reported tables so that the SOTA claim can be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of our dataless regularization approach. We address the major comment below and have incorporated clarifications and additional discussion into the revised manuscript.

read point-by-point responses

Referee: The load-bearing premise that K-FAC factors computed solely on the base model (or a small calibration set) remain a sufficient proxy for the actual change in hidden activations induced by task vectors is not accompanied by direct validation of approximation error as a function of task-vector magnitude or network depth. Without such controls, it is unclear whether the regularizer under-penalizes interference or introduces spurious constraints that could degrade single-task performance.

Authors: We agree that explicit controls on approximation error would provide additional reassurance. Our approach uses K-FAC factors from the base model to approximate the local curvature and thereby penalize representation drift without task data. While the original manuscript does not include direct plots of approximation error versus task-vector magnitude or depth, the reported robustness to rescaling (across multiple scales in our experiments) and the consistent preservation of single-task accuracy indicate that the proxy does not systematically under-penalize interference or impose harmful constraints. In the revision we have expanded Section 3.2 with a brief theoretical motivation for employing base-model K-FAC and a short discussion of its expected limitations as a first-order surrogate. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation adopts external K-FAC curvature approximation as independent proxy

full rationale

The paper frames regularization against representation drift as a curvature-matrix approximation problem and directly adopts the established Kronecker-Factored Approximate Curvature (K-FAC) technique. This step relies on prior external literature rather than self-citation chains, fitted parameters renamed as predictions, or self-definitional loops. The central claim of a dataless regularizer with constant task complexity is constructed from the external approximation and does not reduce to its own inputs by construction. No load-bearing self-citations or ansatzes smuggled via prior author work are present in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that K-FAC provides a usable proxy for representation drift without task data; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Kronecker-Factored Approximate Curvature is a sufficient approximation for regularizing representation drift in the absence of task data.
This premise is invoked when the authors frame regularization against representation drift as a curvature matrix approximation problem.

pith-pipeline@v0.9.0 · 5687 in / 1294 out tokens · 32012 ms · 2026-05-22T11:03:27.206379+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem... adopt Kronecker-Factored Approximate Curvature
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the Jacobian Gram matrix is an instance of the generalized Gauss-Newton (GGN) matrix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security,

work page 2016
[2]

Practical secure aggregation for privacy- preserving machine learning

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy- preserving machine learning. Inproceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,

work page 2017
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large anno- tated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2015
[4]

Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,

11 Published as a conference paper at ICLR 2026 Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,

work page 2026
[5]

Neural tangent kernel: Convergence and gen- eralization in neural networks.Advances in Neural Information Processing Systems,

12 Published as a conference paper at ICLR 2026 Arthur Jacot, Franck Gabriel, and Cl ´ement Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks.Advances in Neural Information Processing Systems,

work page 2026
[6]

Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014),

work page 2014
[7]

Communication-efficient learning of deep networks from decentralized data

13 Published as a conference paper at ICLR 2026 Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics,

work page 2026
[8]

The german traffic sign recognition benchmark: a multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,

work page 2011
[9]

Parameter efficient multi-task model fusion with partial linearization

14 Published as a conference paper at ICLR 2026 Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter efficient multi-task model fusion with partial linearization. InInternational Conference on Learning Representations,

work page 2026
[10]

A broad-coverage challenge corpus for sen- tence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. InProceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies,

work page 2018
[11]

B discusses the main limitations of our approach, including memory requirements and curvature-estimation challenges

15 Published as a conference paper at ICLR 2026 A APPENDIX/ SUPPLEMENTARYMATERIAL The appendix is organized as follows: • Sec. B discusses the main limitations of our approach, including memory requirements and curvature-estimation challenges. • Sec. C provides a derivation and a formal bound on the approximation error introduced when merging multiple KFA...

work page 2026
[12]

should be explored. While we combine the well-established KFAC with an accumulation strategy, designing curvature approximations that can easily be merged without sacri- ficing accuracy may be worth exploring in the future. Moreover, our experiments in the text domain indicate room for improvement, raising the question of whether more sophisticated techni...

work page 2026
[13]

Non linear fine-tuning Ilharco et al

in ViT-B/16. Non linear fine-tuning Ilharco et al. (2022), Linear fine-tuning Ortiz-Jimenez et al. (2023), Attention- Only fine-tuning Jin et al. (2025), Linear fine-tuning with KFAC regularization. wheredist(y 1, y2) =1(y 1 ̸=y 2). Whenξ(α 1, α2) = 0, tasksτ 1 andτ 2 merge without interference for the corresponding values ofα 1 andα

work page 2022
[14]

This property is further enhanced under our regularization regime, where only a few darker regions remain, mostly forα >1, a setting that is never used in practice

As shown in the plots, linearized fine-tuning substantially improves the disentanglement of task vectors. This property is further enhanced under our regularization regime, where only a few darker regions remain, mostly forα >1, a setting that is never used in practice. Notably, in our experiments the disentanglement error is consistently close to zero al...

work page 2026
[15]

For training the task vectors, we followed the setup of previous works Ilharco et al

and conduct experiments on Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019), GTSRB (Stal- lkamp et al., 2011), MNIST (LeCun et al., 2002), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011). For training the task vectors, we followed the setup of previous works Ilharco et ...

work page 2013
[16]

Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,

which are three-way classification tasks where the relation between a premise and a hypothesis must be iden- tified as entailment, contradiction, or neutral. Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,

work page 2018
[17]

Results are reported forα= 1.0and the best-performingα

are binary entailment tasks, and therefore fine-tuning 19 Published as a conference paper at ICLR 2026 Table 4: Comparison of different merging strategies in the linear fine-tuning regime, with and with- out KFAC regularization. Results are reported forα= 1.0and the best-performingα. Methodα ViT-B/32 ViT-B/16 Abs. Norm. Abs. Norm. Linear FT + TIES Yadav e...

work page 2026
[18]

Consis- tent with prior work Ortiz-Jimenez et al

Interestingly, the larger ViT-L/14 backbone exhibits smaller relative gains from regularization, particularly in the non-linear regime, where its behavior closely resembles that of its linearized counterpart. Consis- tent with prior work Ortiz-Jimenez et al. (2023), this suggests that very large models may already display an implicit form of regularizatio...

work page 2023
[19]

ViT-B/32 λSeed7Seed21Seed42A VG. 0 75.0 75.4 75.1 75.2±0.028 1 82.2 82.4 80.6 81.7±0.648 10 85.2 85.1 85.1 85.1±0.002 100 86.2 85.8 86.0 86.0±0.026 1000 86.5 86.4 86.4 86.4±0.002 10000 84.5 84.4 84.3 84.4±0.006 ViT-B/16 λSeed7Seed21Seed42A VG. 0 79.1 78.7 79.1 79.0±0.188 1 83.2 83.4 83.8 83.5±0.265 50 86.9 86.8 87.0 86.9±0.059 500 88.0 87.9 88.2 88.0±0.11...

work page 2025
[20]

Results show that a universal, task- agnostic KFAC (ImageNet-KFAC) retains most of the benefits of our regularizer while requiring no access to auxiliary task-specific data

This behavior confirms that successful model merging primarily depends on the presence of regularization based on information from the generalized Gauss-Newton matrix, 21 Published as a conference paper at ICLR 2026 Table 6: Task addition results on the eight vision datasets when using either task-specific KFAC factors or a single shared KFAC computed on ...

work page 2026
[21]

into5tasks of2classes each. After fine-tuning the base model on each partition, the resulting models are merged and evaluated on the full test set, considering the union of all classes across tasks rather than restricting evaluation to the classes of the training task only, as done in the 8 Vision benchmark. Accuracy is then reported on this joint classif...

work page 2025
[22]

8, our method exhibits consistently strong performance in the text domain, mirroring the trends observed in the vision setting

As reported in Tab. 8, our method exhibits consistently strong performance in the text domain, mirroring the trends observed in the vision setting. G RELATED WORKS ONLINEARIZEDFINE-TUNING Linearized models offer a principled lens for analyzing fine-tuning by considering first-order ex- pansions around a pre-trained initialization. Foundational work (Arora...

work page 2019
[23]

showed that infinitely wide networks trained with gradient descent follow kernel gradient flow under the Neural Tangent Kernel (NTK), yielding exact functional characterizations of training dynamics. This perspective has since been extended to more realistic settings, including representa- tion learning (Mu et al., 2020), small-data regimes (Arora et al.,...

work page 2020

[1] [1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security,

work page 2016

[2] [2]

Practical secure aggregation for privacy- preserving machine learning

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy- preserving machine learning. Inproceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,

work page 2017

[3] [3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large anno- tated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2015

[4] [4]

Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,

11 Published as a conference paper at ICLR 2026 Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,

work page 2026

[5] [5]

Neural tangent kernel: Convergence and gen- eralization in neural networks.Advances in Neural Information Processing Systems,

12 Published as a conference paper at ICLR 2026 Arthur Jacot, Franck Gabriel, and Cl ´ement Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks.Advances in Neural Information Processing Systems,

work page 2026

[6] [6]

Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014),

work page 2014

[7] [7]

Communication-efficient learning of deep networks from decentralized data

13 Published as a conference paper at ICLR 2026 Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics,

work page 2026

[8] [8]

The german traffic sign recognition benchmark: a multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,

work page 2011

[9] [9]

Parameter efficient multi-task model fusion with partial linearization

14 Published as a conference paper at ICLR 2026 Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter efficient multi-task model fusion with partial linearization. InInternational Conference on Learning Representations,

work page 2026

[10] [10]

A broad-coverage challenge corpus for sen- tence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. InProceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies,

work page 2018

[11] [11]

B discusses the main limitations of our approach, including memory requirements and curvature-estimation challenges

15 Published as a conference paper at ICLR 2026 A APPENDIX/ SUPPLEMENTARYMATERIAL The appendix is organized as follows: • Sec. B discusses the main limitations of our approach, including memory requirements and curvature-estimation challenges. • Sec. C provides a derivation and a formal bound on the approximation error introduced when merging multiple KFA...

work page 2026

[12] [12]

should be explored. While we combine the well-established KFAC with an accumulation strategy, designing curvature approximations that can easily be merged without sacri- ficing accuracy may be worth exploring in the future. Moreover, our experiments in the text domain indicate room for improvement, raising the question of whether more sophisticated techni...

work page 2026

[13] [13]

Non linear fine-tuning Ilharco et al

in ViT-B/16. Non linear fine-tuning Ilharco et al. (2022), Linear fine-tuning Ortiz-Jimenez et al. (2023), Attention- Only fine-tuning Jin et al. (2025), Linear fine-tuning with KFAC regularization. wheredist(y 1, y2) =1(y 1 ̸=y 2). Whenξ(α 1, α2) = 0, tasksτ 1 andτ 2 merge without interference for the corresponding values ofα 1 andα

work page 2022

[14] [14]

This property is further enhanced under our regularization regime, where only a few darker regions remain, mostly forα >1, a setting that is never used in practice

As shown in the plots, linearized fine-tuning substantially improves the disentanglement of task vectors. This property is further enhanced under our regularization regime, where only a few darker regions remain, mostly forα >1, a setting that is never used in practice. Notably, in our experiments the disentanglement error is consistently close to zero al...

work page 2026

[15] [15]

For training the task vectors, we followed the setup of previous works Ilharco et al

and conduct experiments on Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019), GTSRB (Stal- lkamp et al., 2011), MNIST (LeCun et al., 2002), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011). For training the task vectors, we followed the setup of previous works Ilharco et ...

work page 2013

[16] [16]

Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,

which are three-way classification tasks where the relation between a premise and a hypothesis must be iden- tified as entailment, contradiction, or neutral. Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,

work page 2018

[17] [17]

Results are reported forα= 1.0and the best-performingα

are binary entailment tasks, and therefore fine-tuning 19 Published as a conference paper at ICLR 2026 Table 4: Comparison of different merging strategies in the linear fine-tuning regime, with and with- out KFAC regularization. Results are reported forα= 1.0and the best-performingα. Methodα ViT-B/32 ViT-B/16 Abs. Norm. Abs. Norm. Linear FT + TIES Yadav e...

work page 2026

[18] [18]

Consis- tent with prior work Ortiz-Jimenez et al

Interestingly, the larger ViT-L/14 backbone exhibits smaller relative gains from regularization, particularly in the non-linear regime, where its behavior closely resembles that of its linearized counterpart. Consis- tent with prior work Ortiz-Jimenez et al. (2023), this suggests that very large models may already display an implicit form of regularizatio...

work page 2023

[19] [19]

ViT-B/32 λSeed7Seed21Seed42A VG. 0 75.0 75.4 75.1 75.2±0.028 1 82.2 82.4 80.6 81.7±0.648 10 85.2 85.1 85.1 85.1±0.002 100 86.2 85.8 86.0 86.0±0.026 1000 86.5 86.4 86.4 86.4±0.002 10000 84.5 84.4 84.3 84.4±0.006 ViT-B/16 λSeed7Seed21Seed42A VG. 0 79.1 78.7 79.1 79.0±0.188 1 83.2 83.4 83.8 83.5±0.265 50 86.9 86.8 87.0 86.9±0.059 500 88.0 87.9 88.2 88.0±0.11...

work page 2025

[20] [20]

Results show that a universal, task- agnostic KFAC (ImageNet-KFAC) retains most of the benefits of our regularizer while requiring no access to auxiliary task-specific data

This behavior confirms that successful model merging primarily depends on the presence of regularization based on information from the generalized Gauss-Newton matrix, 21 Published as a conference paper at ICLR 2026 Table 6: Task addition results on the eight vision datasets when using either task-specific KFAC factors or a single shared KFAC computed on ...

work page 2026

[21] [21]

into5tasks of2classes each. After fine-tuning the base model on each partition, the resulting models are merged and evaluated on the full test set, considering the union of all classes across tasks rather than restricting evaluation to the classes of the training task only, as done in the 8 Vision benchmark. Accuracy is then reported on this joint classif...

work page 2025

[22] [22]

8, our method exhibits consistently strong performance in the text domain, mirroring the trends observed in the vision setting

As reported in Tab. 8, our method exhibits consistently strong performance in the text domain, mirroring the trends observed in the vision setting. G RELATED WORKS ONLINEARIZEDFINE-TUNING Linearized models offer a principled lens for analyzing fine-tuning by considering first-order ex- pansions around a pre-trained initialization. Foundational work (Arora...

work page 2019

[23] [23]

showed that infinitely wide networks trained with gradient descent follow kernel gradient flow under the Neural Tangent Kernel (NTK), yielding exact functional characterizations of training dynamics. This perspective has since been extended to more realistic settings, including representa- tion learning (Mu et al., 2020), small-data regimes (Arora et al.,...

work page 2020