Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
Pith reviewed 2026-05-25 05:40 UTC · model grok-4.3
The pith
Non-linear fine-tuning inherits linearized task-vector properties by distilling hidden representations from a curvature-regularized teacher.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Linearity with respect to weight perturbations can be enforced through activation-space constraints by distilling hidden representations from a curvature-regularized linearized teacher into a non-linear student trained with conventional fine-tuning. The resulting student produces task vectors that remain disentangled and resistant to interference, enabling effective composition on vision and language benchmarks without inference-time overhead.
What carries the argument
Distillation of hidden representations from a curvature-regularized linearized teacher into a conventionally fine-tuned non-linear student.
If this is right
- Task vectors from the non-linear student can be added and subtracted for model merging and unlearning with low interference.
- The student achieves strong performance on vision and language task-arithmetic benchmarks comparable to linearized models.
- No inference-time overhead is incurred relative to standard non-linear fine-tuning.
- The approach works across multiple vision and language benchmarks without changing the inference architecture.
Where Pith is reading between the lines
- Activation-space distillation may serve as a general method to induce other desirable parameter-space properties without explicit linearization.
- The distillation overhead occurs only at training time, so the technique could be applied once and reused across many downstream edits.
- If the transfer proves robust, practitioners could fine-tune once with this procedure and then freely compose task vectors on the resulting model.
- The method leaves open whether similar distillation can transfer properties such as calibration or adversarial robustness.
Load-bearing premise
Matching hidden representations from the curvature-regularized linearized teacher during training is enough to transfer the disentanglement and low-interference properties of linearized task vectors to the non-linear student.
What would settle it
Task vectors extracted from the distilled non-linear student exhibit high interference or fail to compose cleanly on held-out benchmarks despite the distillation loss converging.
Figures
read the original abstract
Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that linearity w.r.t. weight perturbations (a parameter-space property) can be enforced via activation-space constraints by distilling hidden representations from a curvature-regularized linearized teacher into a non-linear student trained with standard fine-tuning; the resulting model is asserted to inherit disentangled, low-interference task vectors, enabling effective task arithmetic on vision and language benchmarks with no inference overhead.
Significance. If the transfer of properties holds, the method would make task-vector composition practical in standard non-linear models, removing the expressivity and inference-cost limitations of linearized fine-tuning while retaining its advantages for model merging and editing.
major comments (2)
- [Abstract] Abstract: the claim of 'strong performance across vision and language benchmarks' is unsupported by any quantitative results, baselines, controls, or experimental details, preventing verification of the central claim.
- [Method description] The distillation procedure: matching hidden representations from the curvature-regularized linearized teacher on single-task inputs is presented as sufficient to transfer disentanglement and low-interference properties, yet no explicit mechanism, bound, or test is given showing why this activation-space proxy constrains non-linear interactions under additive composition of multiple task vectors in parameter space.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy or interference metric) to ground the performance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, proposing revisions to improve clarity and verifiability where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'strong performance across vision and language benchmarks' is unsupported by any quantitative results, baselines, controls, or experimental details, preventing verification of the central claim.
Authors: We agree that the abstract would be strengthened by greater specificity to allow immediate verification. The full manuscript reports quantitative results with baselines and controls in Sections 4.1–4.3 (vision benchmarks including CIFAR and ImageNet; language tasks including GLUE subsets), but these details are not summarized numerically in the abstract. We will revise the abstract to include key metrics (e.g., relative performance on multi-task merging and unlearning) and reference the experimental setup. revision: yes
-
Referee: [Method description] The distillation procedure: matching hidden representations from the curvature-regularized linearized teacher on single-task inputs is presented as sufficient to transfer disentanglement and low-interference properties, yet no explicit mechanism, bound, or test is given showing why this activation-space proxy constrains non-linear interactions under additive composition of multiple task vectors in parameter space.
Authors: This correctly identifies a gap in the current exposition. The manuscript motivates the approach via the equivalence of curvature regularization to linearized behavior and provides empirical evidence that the resulting student models support effective task arithmetic (reduced interference on composed vectors). However, it does not supply a formal bound or explicit test isolating why single-task activation matching suffices to constrain non-linear cross terms under parameter-space addition. We will add a dedicated discussion subsection with additional controls measuring effective linearity on multi-vector compositions. revision: partial
Circularity Check
No circularity: empirical distillation method with independent empirical validation
full rationale
The paper presents a training procedure (distilling activations from a curvature-regularized linearized teacher into a non-linear student) whose claimed benefits are evaluated on external vision and language benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the central claim to a tautology or to quantities defined by the method itself. The derivation chain consists of a proposed proxy (activation matching) whose sufficiency is treated as an empirical question rather than proven by construction. This is the normal case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of deep learning optimization and distillation hold for the teacher-student setup.
Reference graph
Works this paper leans on
-
[1]
R., Angeli, G., Potts, C., and Manning, C
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language infer- ence. InProceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP),
work page 2015
-
[2]
Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,
Buzzega, P., Salami, R., Porrello, A., and Calderara, S. Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,
-
[3]
Erdogan, M. Tangent space fine-tuning for directional prefer- ence alignment in large language models.arXiv preprint arXiv:2602.01128,
-
[4]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. Semeval-2014 task 1: Evaluation of compositional distributional semantic mod- els on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th interna- tional workshop on semantic evaluation (SemEval 2014),
work page 2014
-
[6]
The german traffic sign recognition benchmark: a multi-class classification competition
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,
work page 2011
-
[7]
13 Distilling Linearized Behavior for Effective Task Arithmetic Appendix The appendix is organized as follows: • Sec. A provides an analysis of the computational footprint, including memory requirements, training overhead, and inference costs. • Sec. B details the implementation of our methods, with separate discussions for the vision and text domains. • ...
work page 1914
-
[8]
The code for replicating our method is available at https://github.com/ apanariello4/merge-and-rebase(Panariello et al., 2026). Vision domain.We evaluate our framework on the 8-Vision benchmark (Ilharco et al., 2022a) which comprises eight heterogeneous image classification datasets: Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT ...
work page 2026
-
[9]
serves as an extension for scalability analysis, incorporating six additional classification datasets: CIFAR100 (Krizhevsky et al., 2009), STL10 (Coates et al., 2011), Flowers102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pet (Parkhi et al., 2012), PCAM (Veeling et al., 2018), and FER2013 (Goodfellow et al., 2013). For training vision task vectors, we foll...
work page 2009
-
[10]
The distillation loss term in Eq. (7) is weighted by γ= 1.0 . We grid the coefficient α in (0; 1] for task addition and in (0; 2] for task negation. Language domain.Natural Language Inference (NLI) experiments are tested on the 6-NLI benchmark (Stoica et al., 2025), which includes six datasets: SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018),...
work page 2025
-
[11]
and QQP (DataCanary et al., 2017), for language tasks. As shown, while the choice of reference dataset has a measurable impact, the overall performance remains highly robust. In all cases, the student model maintains state-of-the-art results, outperforming all competing methods from Tab. 2 across nearly all evaluation settings. D. Additional experiments. ...
work page 2017
-
[12]
and our distillation-based approach. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80Accuracy (%) ViT-B/32 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-L/14 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-B/32 14-Vision Non-Linear FT Linear FT ISO TSV CORE + ISO CORE + TSV DELTA Figure 12.Sensitivity of the me...
work page 1991
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.