Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

Angelo Porrello; Francesca Morandi; Simone Calderara; Thomas Sommariva

arxiv: 2605.18993 · v2 · pith:C6A7VSIWnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

Thomas Sommariva , Francesca Morandi , Simone Calderara , Angelo Porrello This is my paper

Pith reviewed 2026-05-25 05:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords task arithmeticmodel merginglinear fine-tuningknowledge distillationtask vectorsmodel editingfine-tuning

0 comments

The pith

Non-linear fine-tuning inherits linearized task-vector properties by distilling hidden representations from a curvature-regularized teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that the disentangled, low-interference task vectors produced by linear fine-tuning can be transferred to ordinary non-linear models. Linear fine-tuning works well for adding and subtracting task vectors but limits what the model can learn and raises inference cost. The authors enforce the desired parameter-space linearity by constraining the student model's activations to match those of a curvature-regularized linearized teacher during standard fine-tuning. If this transfer succeeds, task arithmetic becomes practical on full-capacity models without any extra cost at inference time. A reader would care because model merging and unlearning would no longer require a separate linearized training pipeline.

Core claim

Linearity with respect to weight perturbations can be enforced through activation-space constraints by distilling hidden representations from a curvature-regularized linearized teacher into a non-linear student trained with conventional fine-tuning. The resulting student produces task vectors that remain disentangled and resistant to interference, enabling effective composition on vision and language benchmarks without inference-time overhead.

What carries the argument

Distillation of hidden representations from a curvature-regularized linearized teacher into a conventionally fine-tuned non-linear student.

If this is right

Task vectors from the non-linear student can be added and subtracted for model merging and unlearning with low interference.
The student achieves strong performance on vision and language task-arithmetic benchmarks comparable to linearized models.
No inference-time overhead is incurred relative to standard non-linear fine-tuning.
The approach works across multiple vision and language benchmarks without changing the inference architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Activation-space distillation may serve as a general method to induce other desirable parameter-space properties without explicit linearization.
The distillation overhead occurs only at training time, so the technique could be applied once and reused across many downstream edits.
If the transfer proves robust, practitioners could fine-tune once with this procedure and then freely compose task vectors on the resulting model.
The method leaves open whether similar distillation can transfer properties such as calibration or adversarial robustness.

Load-bearing premise

Matching hidden representations from the curvature-regularized linearized teacher during training is enough to transfer the disentanglement and low-interference properties of linearized task vectors to the non-linear student.

What would settle it

Task vectors extracted from the distilled non-linear student exhibit high interference or fail to compose cleanly on held-out benchmarks despite the distillation loss converging.

Figures

Figures reproduced from arXiv: 2605.18993 by Angelo Porrello, Francesca Morandi, Simone Calderara, Thomas Sommariva.

**Figure 1.** Figure 1: Overview. To improve weight disentanglement, a nonlinear student model is fine-tuned by distilling a linearized teacher. Both models are trained with curvature-aware regularization based on an approximation of the Generalized Gauss-Newton matrix. et al., 2025; 2026a;b) of existing models, rather than training new ones. In this context, practitioners can reuse learned capabilities, rapidly customize model… view at source ↗

**Figure 2.** Figure 2: The heatmaps show the disentanglement error (Ortiz-Jimenez et al., 2023) of a non-linear CLIP ViT-B/32 (left) and the non-linear distilled student (right) on several task pairs. The light regions denote areas of the weight space where weight disentanglement is stronger. Online feature-level distillation. The two models are trained jointly in an online fashion, eliminating the need for separate training sta… view at source ↗

**Figure 4.** Figure 4: Histogram of the linearization error on 8-Vision. 5. Model Analysis Sec. 4.1 showed that DELTA outperforms conventional finetuning on Task Addition and Negation. Here, we investigate the reasons behind these gains. We focus on two properties learned by the student: the transfer of linearized behavior and the emergence of support localization. We then assess their relative contributions to task arithmetic … view at source ↗

**Figure 3.** Figure 3: Sensitivity to the scaling coefficient in LoRA merging. Additional results are provided in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Support Localization. Histograms of the activation MSE between the pre-trained and individually fine-tuned models. In our method, the in-domain and out-of-domain distributions are markedly more separated, indicating stronger weight disentanglement. To assess support localization, for each model t we measure the edit distance, i.e., the mean squared error (MSE) between pre- and post-fine-tuning activations… view at source ↗

**Figure 6.** Figure 6: Impact on task arithmetic. Per-task accuracy of the merged model under different configurations. Similar trends persist for ViT-L/14 and T5-Base, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: Teacher-Student comparison. Left: Accuracy of the merged models. Right: Accuracy of the single task fine-tunings. 15 30 45 60 Verbosity (%) 36 42 48 54 Helpfulness (%) Best H−V Reward Reward Pareto Front 28 32 36 40 Verbosity (%) 76 78 80 Helpfulness (%) Best H−V Accuracy Accuracy Pareto Front DPO-Mixed Non-Linear DPO Linear DPO Distilled DPO [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 8.** Figure 8: Effect of APKD on robustness to α-sweep. performs the linearized teacher on all individual tasks. This suggests that DELTA does not merely replicate linearized behavior; rather, the student operates in an intermediate, quasi-linear regime that combines the capacity of the nonlinear regime with the disentanglement of the linear one. Extension to Generative LLMs. Following Erdogan (2026), we go beyond class… view at source ↗

**Figure 11.** Figure 11: Weight Disentanglement (Ortiz-Jimenez et al., 2023) for Non-Linear FT, Linear FT, Attention-Only FT (Jin et al., 2025) and our distillation-based approach. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [] 20 40 60 80 Accuracy (%) ViT-B/32 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [] 20 40 60 80 ViT-L/14 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [] 20 40 60 80 ViT-B/32 14-Vision Non-Linear FT Linear FT ISO TSV CORE + ISO … view at source ↗

**Figure 12.** Figure 12: Sensitivity of the merged model to the scaling coefficient on LoRA checkpoints. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Individual task performance. Accuracy of the merged model on each task, illustrating the impact of our distillation framework across different architectures and benchmarks [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Student - Teacher comparison. Accuracy of the single task fine-tuning on the corresponding dataset 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Histogram of the linearization error with the ViT-L/14 backbone. Linearization error. In [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Activation MSE Distributions. Extended histograms of the activation MSE between the pre-trained and fine-tuned models, as defined in [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Activation MSE Distributions on ViT-L/14. Histograms of the activation MSE evaluated on the ViT-L/14 backbone. Consistent with the base architecture, the ablation of our loss components confirms that curvature regularization remains the primary driver of support localization at scale. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

read the original abstract

Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The distillation trick to import linearized task-vector properties into ordinary fine-tuning is a sensible idea, but the abstract supplies zero numbers or controls so the central claim cannot be checked.

read the letter

The paper's core move is to distill hidden activations from a curvature-regularized linearized teacher into a standard non-linear student so that the student ends up with task vectors that add and subtract cleanly. That specific transfer—parameter-space linearity enforced via activation-space constraints—is the new piece; it sits on top of existing linear fine-tuning and distillation work but applies the combination to task arithmetic in a way that has not been tried before. The motivation is clear and practical: linearized models give good disentanglement for merging and unlearning but lose expressivity and add inference cost, so removing those drawbacks while keeping the arithmetic benefits would be useful. The write-up states the problem cleanly and describes a concrete procedure that avoids inference overhead. The main weakness is the complete absence of evidence. The abstract claims strong results on vision and language benchmarks yet reports no accuracies, no baselines, no ablations on the distillation loss, and no direct measurements of interference under composed task vectors. The weakest link is the unexamined assumption that single-task activation matching will prevent non-linear interactions when multiple task vectors are added at test time; nothing in the description shows why that proxy should hold or provides a bound. If the full paper contains controlled experiments that close this gap, the work becomes worth taking seriously. For readers working on model editing and merging, the idea is worth following if the numbers appear. It deserves peer review once the experimental section is present and the controls are shown; without them the manuscript is too thin to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper claims that linearity w.r.t. weight perturbations (a parameter-space property) can be enforced via activation-space constraints by distilling hidden representations from a curvature-regularized linearized teacher into a non-linear student trained with standard fine-tuning; the resulting model is asserted to inherit disentangled, low-interference task vectors, enabling effective task arithmetic on vision and language benchmarks with no inference overhead.

Significance. If the transfer of properties holds, the method would make task-vector composition practical in standard non-linear models, removing the expressivity and inference-cost limitations of linearized fine-tuning while retaining its advantages for model merging and editing.

major comments (2)

[Abstract] Abstract: the claim of 'strong performance across vision and language benchmarks' is unsupported by any quantitative results, baselines, controls, or experimental details, preventing verification of the central claim.
[Method description] The distillation procedure: matching hidden representations from the curvature-regularized linearized teacher on single-task inputs is presented as sufficient to transfer disentanglement and low-interference properties, yet no explicit mechanism, bound, or test is given showing why this activation-space proxy constrains non-linear interactions under additive composition of multiple task vectors in parameter space.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy or interference metric) to ground the performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, proposing revisions to improve clarity and verifiability where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'strong performance across vision and language benchmarks' is unsupported by any quantitative results, baselines, controls, or experimental details, preventing verification of the central claim.

Authors: We agree that the abstract would be strengthened by greater specificity to allow immediate verification. The full manuscript reports quantitative results with baselines and controls in Sections 4.1–4.3 (vision benchmarks including CIFAR and ImageNet; language tasks including GLUE subsets), but these details are not summarized numerically in the abstract. We will revise the abstract to include key metrics (e.g., relative performance on multi-task merging and unlearning) and reference the experimental setup. revision: yes
Referee: [Method description] The distillation procedure: matching hidden representations from the curvature-regularized linearized teacher on single-task inputs is presented as sufficient to transfer disentanglement and low-interference properties, yet no explicit mechanism, bound, or test is given showing why this activation-space proxy constrains non-linear interactions under additive composition of multiple task vectors in parameter space.

Authors: This correctly identifies a gap in the current exposition. The manuscript motivates the approach via the equivalence of curvature regularization to linearized behavior and provides empirical evidence that the resulting student models support effective task arithmetic (reduced interference on composed vectors). However, it does not supply a formal bound or explicit test isolating why single-task activation matching suffices to constrain non-linear cross terms under parameter-space addition. We will add a dedicated discussion subsection with additional controls measuring effective linearity on multi-vector compositions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical distillation method with independent empirical validation

full rationale

The paper presents a training procedure (distilling activations from a curvature-regularized linearized teacher into a non-linear student) whose claimed benefits are evaluated on external vision and language benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the central claim to a tautology or to quantities defined by the method itself. The derivation chain consists of a proposed proxy (activation matching) whose sufficiency is treated as an empirical question rather than proven by construction. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or non-standard axioms; the method implicitly relies on standard deep-learning training assumptions.

axioms (1)

domain assumption Standard assumptions of deep learning optimization and distillation hold for the teacher-student setup.
The approach assumes typical convergence and representation-matching behavior in fine-tuning.

pith-pipeline@v0.9.0 · 5708 in / 1124 out tokens · 30585 ms · 2026-05-25T05:40:47.628644+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

R., Angeli, G., Potts, C., and Manning, C

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language infer- ence. InProceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP),

work page 2015
[2]

Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,

Buzzega, P., Salami, R., Porrello, A., and Calderara, S. Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,

work page arXiv
[3]

Tangent space fine-tuning for directional prefer- ence alignment in large language models.arXiv preprint arXiv:2602.01128,

Erdogan, M. Tangent space fine-tuning for directional prefer- ence alignment in large language models.arXiv preprint arXiv:2602.01128,

work page arXiv
[4]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Semeval-2014 task 1: Evaluation of compositional distributional semantic mod- els on full sentences through semantic relatedness and textual entailment

Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. Semeval-2014 task 1: Evaluation of compositional distributional semantic mod- els on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th interna- tional workshop on semantic evaluation (SemEval 2014),

work page 2014
[6]

The german traffic sign recognition benchmark: a multi-class classification competition

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,

work page 2011
[7]

A provides an analysis of the computational footprint, including memory requirements, training overhead, and inference costs

13 Distilling Linearized Behavior for Effective Task Arithmetic Appendix The appendix is organized as follows: • Sec. A provides an analysis of the computational footprint, including memory requirements, training overhead, and inference costs. • Sec. B details the implementation of our methods, with separate discussions for the vision and text domains. • ...

work page 1914
[8]

The code for replicating our method is available at https://github.com/ apanariello4/merge-and-rebase(Panariello et al., 2026). Vision domain.We evaluate our framework on the 8-Vision benchmark (Ilharco et al., 2022a) which comprises eight heterogeneous image classification datasets: Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT ...

work page 2026
[9]

For training vision task vectors, we followed the setup of previous works (Ilharco et al., 2022a; Ortiz-Jimenez et al., 2023; Yoshida et al., 2025), adopting a batch size of

serves as an extension for scalability analysis, incorporating six additional classification datasets: CIFAR100 (Krizhevsky et al., 2009), STL10 (Coates et al., 2011), Flowers102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pet (Parkhi et al., 2012), PCAM (Veeling et al., 2018), and FER2013 (Goodfellow et al., 2013). For training vision task vectors, we foll...

work page 2009
[10]

(7) is weighted by γ= 1.0

The distillation loss term in Eq. (7) is weighted by γ= 1.0 . We grid the coefficient α in (0; 1] for task addition and in (0; 2] for task negation. Language domain.Natural Language Inference (NLI) experiments are tested on the 6-NLI benchmark (Stoica et al., 2025), which includes six datasets: SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018),...

work page 2025
[11]

As shown, while the choice of reference dataset has a measurable impact, the overall performance remains highly robust

and QQP (DataCanary et al., 2017), for language tasks. As shown, while the choice of reference dataset has a measurable impact, the overall performance remains highly robust. In all cases, the student model maintains state-of-the-art results, outperforming all competing methods from Tab. 2 across nearly all evaluation settings. D. Additional experiments. ...

work page 2017
[12]

and our distillation-based approach. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80Accuracy (%) ViT-B/32 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-L/14 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-B/32 14-Vision Non-Linear FT Linear FT ISO TSV CORE + ISO CORE + TSV DELTA Figure 12.Sensitivity of the me...

work page 1991

[1] [1]

R., Angeli, G., Potts, C., and Manning, C

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language infer- ence. InProceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP),

work page 2015

[2] [2]

Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,

Buzzega, P., Salami, R., Porrello, A., and Calderara, S. Rethinking layer-wise model merging through chain of merges.arXiv preprint arXiv:2508.21421,

work page arXiv

[3] [3]

Tangent space fine-tuning for directional prefer- ence alignment in large language models.arXiv preprint arXiv:2602.01128,

Erdogan, M. Tangent space fine-tuning for directional prefer- ence alignment in large language models.arXiv preprint arXiv:2602.01128,

work page arXiv

[4] [4]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Semeval-2014 task 1: Evaluation of compositional distributional semantic mod- els on full sentences through semantic relatedness and textual entailment

Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. Semeval-2014 task 1: Evaluation of compositional distributional semantic mod- els on full sentences through semantic relatedness and textual entailment. InProceedings of the 8th interna- tional workshop on semantic evaluation (SemEval 2014),

work page 2014

[6] [6]

The german traffic sign recognition benchmark: a multi-class classification competition

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The german traffic sign recognition benchmark: a multi-class classification competition. InThe 2011 international joint conference on neural networks,

work page 2011

[7] [7]

A provides an analysis of the computational footprint, including memory requirements, training overhead, and inference costs

13 Distilling Linearized Behavior for Effective Task Arithmetic Appendix The appendix is organized as follows: • Sec. A provides an analysis of the computational footprint, including memory requirements, training overhead, and inference costs. • Sec. B details the implementation of our methods, with separate discussions for the vision and text domains. • ...

work page 1914

[8] [8]

The code for replicating our method is available at https://github.com/ apanariello4/merge-and-rebase(Panariello et al., 2026). Vision domain.We evaluate our framework on the 8-Vision benchmark (Ilharco et al., 2022a) which comprises eight heterogeneous image classification datasets: Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), EuroSAT ...

work page 2026

[9] [9]

For training vision task vectors, we followed the setup of previous works (Ilharco et al., 2022a; Ortiz-Jimenez et al., 2023; Yoshida et al., 2025), adopting a batch size of

serves as an extension for scalability analysis, incorporating six additional classification datasets: CIFAR100 (Krizhevsky et al., 2009), STL10 (Coates et al., 2011), Flowers102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pet (Parkhi et al., 2012), PCAM (Veeling et al., 2018), and FER2013 (Goodfellow et al., 2013). For training vision task vectors, we foll...

work page 2009

[10] [10]

(7) is weighted by γ= 1.0

The distillation loss term in Eq. (7) is weighted by γ= 1.0 . We grid the coefficient α in (0; 1] for task addition and in (0; 2] for task negation. Language domain.Natural Language Inference (NLI) experiments are tested on the 6-NLI benchmark (Stoica et al., 2025), which includes six datasets: SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018),...

work page 2025

[11] [11]

As shown, while the choice of reference dataset has a measurable impact, the overall performance remains highly robust

and QQP (DataCanary et al., 2017), for language tasks. As shown, while the choice of reference dataset has a measurable impact, the overall performance remains highly robust. In all cases, the student model maintains state-of-the-art results, outperforming all competing methods from Tab. 2 across nearly all evaluation settings. D. Additional experiments. ...

work page 2017

[12] [12]

and our distillation-based approach. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80Accuracy (%) ViT-B/32 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-L/14 8-Vision 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 [/u1D736] 20 40 60 80 ViT-B/32 14-Vision Non-Linear FT Linear FT ISO TSV CORE + ISO CORE + TSV DELTA Figure 12.Sensitivity of the me...

work page 1991