Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature
Pith reviewed 2026-05-22 11:03 UTC · model grok-4.3
The pith
Task vectors can be disentangled without any task data by using a Kronecker-factored curvature approximation as a regularizer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting representation-drift regularization as a curvature-matrix approximation task, Kronecker-Factored Approximate Curvature supplies a practical, dataless regularizer for task vectors that preserves performance under addition and negation.
What carries the argument
Kronecker-Factored Approximate Curvature applied as a regularizer that penalizes deviation from the original curvature structure when task vectors are combined.
If this is right
- Task addition and negation reach state-of-the-art accuracy without external data.
- Computational cost stays constant as the number of tasks increases.
- Task-vector rescaling no longer requires separate tuning on held-out data.
Where Pith is reading between the lines
- The same curvature-based framing could be tested on other model-merging techniques that currently rely on data to control interference.
- Because the method is data-free it may be directly usable in privacy-sensitive settings where task examples cannot be stored or shared.
- Replacing K-FAC with a more accurate curvature estimator could be checked to see whether disentanglement improves further.
Load-bearing premise
The curvature matrix approximated by K-FAC is a sufficient proxy for representation drift that can be regularized without any task data.
What would settle it
A direct comparison in which the K-FAC regularizer is applied to a multi-task vector combination and the resulting accuracy on a held-out test set falls below that of a data-dependent baseline or fails to stay stable under task-vector rescaling.
Figures
read the original abstract
Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dataless regularization method for task arithmetic in foundation models to mitigate cross-task interference and representation drift. By framing the regularization as a curvature-matrix approximation problem and adopting Kronecker-Factored Approximate Curvature (K-FAC) computed on the base model, the authors derive a practical regularizer claimed to achieve state-of-the-art performance on task addition and negation. The approach is asserted to have constant complexity in the number of tasks, promote robustness to task-vector rescaling, and eliminate the need for held-out tuning data.
Significance. If the central empirical claims hold and the K-FAC proxy is validated as a faithful surrogate for representation drift without task data, the work would meaningfully advance modular adaptation of foundation models under privacy or data-availability constraints. It usefully repurposes an established second-order approximation technique to remove a key practical barrier in task arithmetic.
major comments (1)
- The load-bearing premise that K-FAC factors computed solely on the base model (or a small calibration set) remain a sufficient proxy for the actual change in hidden activations induced by task vectors is not accompanied by direct validation of approximation error as a function of task-vector magnitude or network depth. Without such controls, it is unclear whether the regularizer under-penalizes interference or introduces spurious constraints that could degrade single-task performance.
minor comments (2)
- Clarify in the methods section exactly how the K-FAC factors are computed (e.g., on which layers and with what damping) and whether any small calibration set is used.
- Add error bars and explicit baseline details (including hyperparameter tuning budgets) to all reported tables so that the SOTA claim can be assessed.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of our dataless regularization approach. We address the major comment below and have incorporated clarifications and additional discussion into the revised manuscript.
read point-by-point responses
-
Referee: The load-bearing premise that K-FAC factors computed solely on the base model (or a small calibration set) remain a sufficient proxy for the actual change in hidden activations induced by task vectors is not accompanied by direct validation of approximation error as a function of task-vector magnitude or network depth. Without such controls, it is unclear whether the regularizer under-penalizes interference or introduces spurious constraints that could degrade single-task performance.
Authors: We agree that explicit controls on approximation error would provide additional reassurance. Our approach uses K-FAC factors from the base model to approximate the local curvature and thereby penalize representation drift without task data. While the original manuscript does not include direct plots of approximation error versus task-vector magnitude or depth, the reported robustness to rescaling (across multiple scales in our experiments) and the consistent preservation of single-task accuracy indicate that the proxy does not systematically under-penalize interference or impose harmful constraints. In the revision we have expanded Section 3.2 with a brief theoretical motivation for employing base-model K-FAC and a short discussion of its expected limitations as a first-order surrogate. revision: partial
Circularity Check
No circularity: derivation adopts external K-FAC curvature approximation as independent proxy
full rationale
The paper frames regularization against representation drift as a curvature-matrix approximation problem and directly adopts the established Kronecker-Factored Approximate Curvature (K-FAC) technique. This step relies on prior external literature rather than self-citation chains, fitted parameters renamed as predictions, or self-definitional loops. The central claim of a dataless regularizer with constant task complexity is constructed from the external approximation and does not reduce to its own inputs by construction. No load-bearing self-citations or ansatzes smuggled via prior author work are present in the provided derivation outline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Kronecker-Factored Approximate Curvature is a sufficient approximation for regularizing representation drift in the absence of task data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem... adopt Kronecker-Factored Approximate Curvature
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the Jacobian Gram matrix is an instance of the generalized Gauss-Newton (GGN) matrix
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep learning with differential privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security,
work page 2016
-
[2]
Practical secure aggregation for privacy- preserving machine learning
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy- preserving machine learning. Inproceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,
work page 2017
-
[3]
Bowman, Gabor Angeli, Christopher Potts, and Christopher D
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large anno- tated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP),
work page 2015
-
[4]
Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,
11 Published as a conference paper at ICLR 2026 Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Bench- mark and state of the art.Proceedings of the IEEE,
work page 2026
-
[5]
12 Published as a conference paper at ICLR 2026 Arthur Jacot, Franck Gabriel, and Cl ´ement Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks.Advances in Neural Information Processing Systems,
work page 2026
-
[6]
Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014),
work page 2014
-
[7]
Communication-efficient learning of deep networks from decentralized data
13 Published as a conference paper at ICLR 2026 Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics,
work page 2026
-
[8]
The german traffic sign recognition benchmark: a multi-class classification competition
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,
work page 2011
-
[9]
Parameter efficient multi-task model fusion with partial linearization
14 Published as a conference paper at ICLR 2026 Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter efficient multi-task model fusion with partial linearization. InInternational Conference on Learning Representations,
work page 2026
-
[10]
A broad-coverage challenge corpus for sen- tence understanding through inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. InProceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies,
work page 2018
-
[11]
15 Published as a conference paper at ICLR 2026 A APPENDIX/ SUPPLEMENTARYMATERIAL The appendix is organized as follows: • Sec. B discusses the main limitations of our approach, including memory requirements and curvature-estimation challenges. • Sec. C provides a derivation and a formal bound on the approximation error introduced when merging multiple KFA...
work page 2026
-
[12]
should be explored. While we combine the well-established KFAC with an accumulation strategy, designing curvature approximations that can easily be merged without sacri- ficing accuracy may be worth exploring in the future. Moreover, our experiments in the text domain indicate room for improvement, raising the question of whether more sophisticated techni...
work page 2026
-
[13]
Non linear fine-tuning Ilharco et al
in ViT-B/16. Non linear fine-tuning Ilharco et al. (2022), Linear fine-tuning Ortiz-Jimenez et al. (2023), Attention- Only fine-tuning Jin et al. (2025), Linear fine-tuning with KFAC regularization. wheredist(y 1, y2) =1(y 1 ̸=y 2). Whenξ(α 1, α2) = 0, tasksτ 1 andτ 2 merge without interference for the corresponding values ofα 1 andα
work page 2022
-
[14]
As shown in the plots, linearized fine-tuning substantially improves the disentanglement of task vectors. This property is further enhanced under our regularization regime, where only a few darker regions remain, mostly forα >1, a setting that is never used in practice. Notably, in our experiments the disentanglement error is consistently close to zero al...
work page 2026
-
[15]
For training the task vectors, we followed the setup of previous works Ilharco et al
and conduct experiments on Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019), GTSRB (Stal- lkamp et al., 2011), MNIST (LeCun et al., 2002), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2016), and SVHN (Netzer et al., 2011). For training the task vectors, we followed the setup of previous works Ilharco et ...
work page 2013
-
[16]
Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,
which are three-way classification tasks where the relation between a premise and a hypothesis must be iden- tified as entailment, contradiction, or neutral. Additionally, SciTail (Khot et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al.,
work page 2018
-
[17]
Results are reported forα= 1.0and the best-performingα
are binary entailment tasks, and therefore fine-tuning 19 Published as a conference paper at ICLR 2026 Table 4: Comparison of different merging strategies in the linear fine-tuning regime, with and with- out KFAC regularization. Results are reported forα= 1.0and the best-performingα. Methodα ViT-B/32 ViT-B/16 Abs. Norm. Abs. Norm. Linear FT + TIES Yadav e...
work page 2026
-
[18]
Consis- tent with prior work Ortiz-Jimenez et al
Interestingly, the larger ViT-L/14 backbone exhibits smaller relative gains from regularization, particularly in the non-linear regime, where its behavior closely resembles that of its linearized counterpart. Consis- tent with prior work Ortiz-Jimenez et al. (2023), this suggests that very large models may already display an implicit form of regularizatio...
work page 2023
-
[19]
ViT-B/32 λSeed7Seed21Seed42A VG. 0 75.0 75.4 75.1 75.2±0.028 1 82.2 82.4 80.6 81.7±0.648 10 85.2 85.1 85.1 85.1±0.002 100 86.2 85.8 86.0 86.0±0.026 1000 86.5 86.4 86.4 86.4±0.002 10000 84.5 84.4 84.3 84.4±0.006 ViT-B/16 λSeed7Seed21Seed42A VG. 0 79.1 78.7 79.1 79.0±0.188 1 83.2 83.4 83.8 83.5±0.265 50 86.9 86.8 87.0 86.9±0.059 500 88.0 87.9 88.2 88.0±0.11...
work page 2025
-
[20]
This behavior confirms that successful model merging primarily depends on the presence of regularization based on information from the generalized Gauss-Newton matrix, 21 Published as a conference paper at ICLR 2026 Table 6: Task addition results on the eight vision datasets when using either task-specific KFAC factors or a single shared KFAC computed on ...
work page 2026
-
[21]
into5tasks of2classes each. After fine-tuning the base model on each partition, the resulting models are merged and evaluated on the full test set, considering the union of all classes across tasks rather than restricting evaluation to the classes of the training task only, as done in the 8 Vision benchmark. Accuracy is then reported on this joint classif...
work page 2025
-
[22]
As reported in Tab. 8, our method exhibits consistently strong performance in the text domain, mirroring the trends observed in the vision setting. G RELATED WORKS ONLINEARIZEDFINE-TUNING Linearized models offer a principled lens for analyzing fine-tuning by considering first-order ex- pansions around a pre-trained initialization. Foundational work (Arora...
work page 2019
-
[23]
showed that infinitely wide networks trained with gradient descent follow kernel gradient flow under the Neural Tangent Kernel (NTK), yielding exact functional characterizations of training dynamics. This perspective has since been extended to more realistic settings, including representa- tion learning (Mu et al., 2020), small-data regimes (Arora et al.,...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.