pith. sign in

arxiv: 2606.12498 · v1 · pith:K25EJZHNnew · submitted 2026-06-10 · 💻 cs.CR · cs.LG

From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

Pith reviewed 2026-06-27 09:22 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords model mergingbackdoor attackstask arithmeticfeature spacecross-task linearityanti-backdoor task vectorbackdoor mitigation
0
0 comments X

The pith

By shifting task arithmetic to feature space, Linear Feature Path Minimization suppresses backdoors in merged models while preserving clean performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that backdoor mitigation in model merging can be achieved by optimizing an anti-backdoor task vector in feature space rather than parameter space. It uses the Cross-Task Linearity framework, which assumes approximate linearity of features across tasks, to guide this optimization. This approach aims to eliminate backdoors introduced during merging without the performance drops seen in prior parameter-editing methods. The framework includes a specific optimization using gradient accumulation and loss path-integral to ensure suppression along the model interpolation path. If successful, it provides a more robust way to combine task-specific models securely.

Core claim

LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance, with an effective optimization mechanism based on gradient accumulation and loss path-integral. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning settings.

What carries the argument

Linear Feature Path Minimization (LFPM) under the Cross-Task Linearity (CTL) framework, which optimizes an anti-backdoor task vector in feature space to suppress backdoors along the interpolation path.

If this is right

  • Backdoor attacks on merged models can be mitigated effectively in both full fine-tuning and Parameter-Efficient Fine-Tuning settings.
  • The optimization ensures robust backdoor suppression along the interpolation path between models.
  • Clean-task performance is preserved better than with direct parameter-space editing methods.
  • Task arithmetic can be extended from parameters to features for security purposes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the feature linearity assumption holds broadly, similar feature-space methods could address other vulnerabilities in model merging such as data poisoning.
  • This suggests that backdoor defenses might benefit from operating in the space where the model's decision boundaries are more linearly separable across tasks.
  • Future work could test whether LFPM scales to merging more than two or three models without additional adjustments.

Load-bearing premise

The assumption that features remain approximately linear across tasks holds sufficiently well that feature-space optimization will reliably suppress backdoors.

What would settle it

An experiment on a merged model where cross-task feature linearity is deliberately broken, after which LFPM no longer reduces backdoor success rate without also lowering clean-task accuracy.

Figures

Figures reproduced from arXiv: 2606.12498 by Haodong Li, Weixiang Li, Wenjian Luo, Yamin Hu, Yiya Diao, Zhenqian Zhu.

Figure 1
Figure 1. Figure 1: The core idea of LFPM. vc and vb denote the directions of the clean-task and backdoor-task vectors, respectively, with −vb indicating the reversed backdoor direction. θm and θk+1 denote the merged model and the anti-backdoor model, while zm and zk+1 are their respective feature representations. Under the CTL framework, robustness along the parameter interpolation path from θm to θk+1 can be achieved by per… view at source ↗
Figure 2
Figure 2. Figure 2: Robustness along the Interpolation Path with Cars196 as the Target Dataset 5.3. Robustness against Potential Adaptive Attack We further evaluate the robustness of LFPM under potential adaptive adversaries. Recall that LFPM is built upon two key principles: (i) subspace partitioning, which disentangles adversarial and clean feature components, and (ii) anti￾backdoor task vector optimization, which performs … view at source ↗
Figure 3
Figure 3. Figure 3: Robustness of the Merged Model along the Interpolation Path (Task Order 2) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Coefficient 0 20 40 60 80 100 ASR(%) Methods LFPM SAM PAM IBVS SAU (a) BadMerging 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Coefficient 0 20 40 60 80 100 ASR(%) Methods LFPM SAM PAM IBVS SAU (b) LoBAM [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Robustness of the Merged Model along the Interpolation Path (Task Order 3) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation of the Cross-Task Linearity Condition We evaluate the CTL condition for both the initial anti-backdoor model θk+1 (obtained at Step 11 of Algorithm 3) and the anti-backdoor model after LFPM optimization. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
read the original abstract

Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Linear Feature Path Minimization (LFPM), a framework for mitigating backdoors in model merging. It shifts from direct parameter-space task arithmetic to a feature-space formulation under the Cross-Task Linearity (CTL) assumption, introducing an optimized 'anti-backdoor task vector' that suppresses backdoor triggers while preserving clean-task performance. The optimization uses gradient accumulation and path-integral loss to ensure robustness along the interpolation path. Experiments are claimed to show consistent robustness in both full fine-tuning and PEFT settings.

Significance. If the CTL approximation holds for backdoor inputs, LFPM offers a conceptually cleaner alternative to existing parameter-editing defenses by grounding mitigation in feature-space linearity. This could improve the clean/backdoor trade-off in merged models and generalize to PEFT scenarios. The introduction of the anti-backdoor task vector and path-integral optimization are concrete technical contributions that, if validated, would strengthen the task-arithmetic literature on security.

major comments (2)
  1. [§3] §3 (CTL framework and anti-backdoor task vector): The central claim requires that approximate linearity of features across tasks continues to hold for backdoor-triggered inputs (which are OOD by design). No analysis, bound, or ablation is provided showing that the linearity error remains comparable on triggered examples versus clean ones; if the error grows, the feature-space loss minimization need not produce a merged model whose parameter-space behavior suppresses triggers.
  2. [§4] §4 (experiments): The abstract states that 'extensive experiments demonstrate consistent robustness,' yet the provided description contains no quantitative metrics, error bars, baseline comparisons, or details on backdoor success rate measurement. Without these, it is impossible to assess whether the reported robustness is robust to post-hoc exclusions or distribution shifts in the trigger set.
minor comments (1)
  1. [§3.1] Notation for the anti-backdoor task vector is introduced without an explicit equation linking it to the merged model parameters; adding a short derivation or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (CTL framework and anti-backdoor task vector): The central claim requires that approximate linearity of features across tasks continues to hold for backdoor-triggered inputs (which are OOD by design). No analysis, bound, or ablation is provided showing that the linearity error remains comparable on triggered examples versus clean ones; if the error grows, the feature-space loss minimization need not produce a merged model whose parameter-space behavior suppresses triggers.

    Authors: We agree that validating the CTL assumption specifically on backdoor-triggered inputs is essential, as these inputs are out-of-distribution by construction. The manuscript relies on the empirical observation that backdoor triggers are small perturbations and that feature linearity observed on clean tasks extends approximately to triggered inputs under the same model. However, we acknowledge the absence of a direct comparison of linearity error. In the revised manuscript we will add an ablation that computes the feature-space linearity error (as defined in the CTL framework) on both clean and triggered examples across the evaluated tasks and report whether the error remains comparable. This will either confirm the assumption or highlight its limitations. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract states that 'extensive experiments demonstrate consistent robustness,' yet the provided description contains no quantitative metrics, error bars, baseline comparisons, or details on backdoor success rate measurement. Without these, it is impossible to assess whether the reported robustness is robust to post-hoc exclusions or distribution shifts in the trigger set.

    Authors: The full experimental section (Section 4) and appendix already contain quantitative results, baseline comparisons (including prior task-arithmetic defenses), backdoor success rates, and clean-task accuracy for both full fine-tuning and PEFT settings. Error bars from multiple random seeds are reported for the main tables. To improve clarity and address the concern about measurement details, we will expand the experimental subsection on evaluation protocol to explicitly describe how backdoor success rate is computed (including trigger-set construction and any distribution-shift tests performed) and ensure all numerical results are accompanied by standard deviations. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation rests on external CTL modeling choice

full rationale

The paper's central construction introduces LFPM as an optimization in feature space guided by the Cross-Task Linearity (CTL) assumption. The provided text presents CTL as an imported modeling framework rather than a quantity defined from the same backdoor-suppression objective or fitted parameters. No equations are shown that equate the claimed robustness directly to a fit performed on the evaluation distribution, and no self-citation chain is invoked to justify uniqueness or the linearity premise. The derivation therefore remains self-contained against external benchmarks and does not reduce by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified Cross-Task Linearity assumption and on the existence of an effective optimization path that simultaneously suppresses backdoors and preserves clean performance. No free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption Features extracted by task-specific models are approximately linear across tasks (Cross-Task Linearity framework).
    Invoked to justify moving the anti-backdoor optimization from parameter space into feature space.
invented entities (1)
  • Anti-backdoor task vector no independent evidence
    purpose: Adjustment added to the merged model to suppress backdoors while preserving clean performance.
    Introduced as the core object optimized by LFPM; no independent evidence of its existence outside the optimization is provided.

pith-pipeline@v0.9.1-grok · 5756 in / 1426 out tokens · 17106 ms · 2026-06-27T09:22:22.641513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    T., Bhardwaj, R., and Poria, S

    Deep, P. T., Bhardwaj, R., and Poria, S. Della-merging: Re- ducing interference in model merging through magnitude- based sampling.arXiv preprint arXiv:2406.11617,

  2. [2]

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, volume 1 (long and short papers), pp. 4171–4186,

  3. [3]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  4. [4]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

  5. [5]

    Freeman, C. D. and Bruna, J. Topology and geometry of half-rectified network optimization.arXiv preprint arXiv:1611.01540,

  6. [6]

    Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. Parameter-efficient fine-tuning for large models: A com- prehensive survey.arXiv preprint arXiv:2403.14608,

  7. [7]

    Hsu C Y , Tsai Y L, Z. Y . e. a. Badtv: Unveiling back- door threats in third-party task vectors.arXiv preprint arXiv:2501.02373,

  8. [8]

    Editing Models with Task Arithmetic

    Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing mod- els with task arithmetic.arXiv preprint arXiv:2212.04089,

  9. [9]

    Backdoor vectors: a task arithmetic view on backdoor attacks and defenses.arXiv preprint arXiv:2510.08016,

    10 From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging Pawlak, S., Dubi´nski, J., Marczak, D., and Twardowski, B. Backdoor vectors: a task arithmetic view on backdoor attacks and defenses.arXiv preprint arXiv:2510.08016,

  10. [10]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

  11. [11]

    Lmsanitator: De- fending prompt-tuning against task-agnostic backdoors

    Wei, C., Meng, W., Zhang, Z., Chen, M., Zhao, M., Fang, W., Wang, L., Zhang, Z., and Chen, W. Lmsanitator: De- fending prompt-tuning against task-agnostic backdoors. arXiv preprint arXiv:2308.13904, 2023a. Wei, S., Zhang, M., Zha, H., and Wu, B. Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples. InAdvances in Neu...

  12. [12]

    Yadav, C

    Yadav, P., Raffel, C., Muqeeth, M., Caccia, L., Liu, H., Chen, T., Bansal, M., Choshen, L., and Sordoni, A. A survey on model merging: Recycling and routing among special- ized experts for collaborative learning.arXiv preprint arXiv:2408.07057,

  13. [13]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

  14. [14]

    Lobam: Lora-based backdoor attack on model merging

    Yin, M., Zhang, J., Sun, J., Fang, M., Li, H., and Chen, Y . Lobam: Lora-based backdoor attack on model merging. arXiv preprint arXiv:2411.16746,

  15. [15]

    Merge hijack- ing: Backdoor attacks to model merging of large language models.arXiv preprint arXiv:2505.23561,

    Yuan, Z., Xu, Y ., Shi, J., Zhou, P., and Sun, L. Merge hijack- ing: Backdoor attacks to model merging of large language models.arXiv preprint arXiv:2505.23561,

  16. [16]

    Badmerging: Backdoor attacks against model merging

    Zhang, J., Chi, J., Li, Z., Cai, K., Zhang, Y ., and Tian, Y . Badmerging: Backdoor attacks against model merging. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 4450–4464,

  17. [17]

    On the emergence of cross-task linearity in the pretraining- finetuning paradigm.arXiv preprint arXiv:2402.03660,

    Zhou, Z., Chen, Z., Chen, Y ., Zhang, B., and Yan, J. On the emergence of cross-task linearity in the pretraining- finetuning paradigm.arXiv preprint arXiv:2402.03660,

  18. [18]

    14 From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging B.4. Curvature Evaluation via Hessian–Vector Products (HVPs) Algorithm 4Curvature Evaluation via Hessian–Vector Products (HVPs) 1: Input:Backdoored merged model θm; Anti-backdoor model θk+1; Number of mini-batches N; Adversarial dataset Dadv; Finite-difference st...

  19. [19]

    Acura RL Sedan 2012

    under parameter-efficient fine-tuning (PEFT) with LoRA. In all attack settings, the adversary injects a backdoor via adversary task vector, such that the triggered samples from the target task are misclassified into an attacker-specified target class. By default, we set class 1 as the target for all target tasks; for example, in the task sequences reporte...

  20. [20]

    targets low-resource adversarial settings by fine-tuning pre-trained models with LoRA. To compensate for the reduced attack effectiveness under PEFT, LoBAM amplifies the backdoor task vector, defined as the parameter difference between the backdoored and clean models, thereby enabling effective backdoor implantation for model merging. Following prior work...

  21. [21]

    IBVS.IBVS (Pawlak et al.,

    and PAM (Min et al., 2024), which enhance robustness by minimizing loss sharpness through weight perturbations during optimization. IBVS.IBVS (Pawlak et al.,

  22. [22]

    to obtain the backdoor task vector Vb and clean task vector Vc. Given a merged model update ∆θm, the refined update is computed as ∆θ ′ m = ∆θm −λ(V b −V c).(81) After hyperparameter tuning, we set λ= 2.0 , which provides a favorable balance between backdoor mitigation and clean performance preservation. SAU.SAU (Wei et al., 2023b) is a state-of-the-art b...

  23. [23]

    comprises two stages: (I) Adversarial Feature Extraction via Subspace Partitioning and (II) Anti-Backdoor Task Vector Optimization. In Stage I, LFPM learns a rank- r projection matrix Ps =U U T to partition the feature subspace and extract adversarial features encoded by learnable visual prompts P (Zhou et al., 2022; Jia et al., 2022). In our experiments,...

  24. [24]

    The anti-backdoor task vector is optimized for 5 epochs

    The optimization is performed on a shadow dataset consisting of 10,000 images randomly sampled from ImageNet-1K (Deng et al., 2009). The anti-backdoor task vector is optimized for 5 epochs. In every optimization step, the above three mechanisms are executed once. Finally, following Step 10 of Algorithm B.1, LFPM merges the anti-backdoor task vector into t...

  25. [25]

    Table 12.Backdoor Defense Performance under BadMerging on ViT-L/14 (ACC↑/ ASR↓). Task→ CIFAR100 Cars196 SUN397 EuroSAT GTSRB Pets Average Method↓ CA ASR(N) CA ASR(T) CA ASR(N) CA ASR(N) CA ASR(N) CA ASR(N) CA ASR(T) ASR(N) Individual89.21 31.57 83.82 0.38 78.61 8.36 98.59 3.61 98.15 2.16 95.20 2.45 90.59 0.38 9.63 TA 91.32 17.18 61.13 85.01 63.60 63.50 32...

  26. [26]

    without explicitly constructing the Hessian. Fellowing Pearlmutter (Pearlmutter, 1994), a Hessian-vector product⟨∇ 2f(θ), v⟩can be computed as the directional derivative of the gradient alongv: ∇2f(θ)v= lim ϵ→0 1 ϵ h ∇f(θ+ϵv)− ∇f(θ) i =∇ θ ⟨∇f(θ), v⟩ .(84) This enables efficient computation via automatic differentiation at a small overhead relative to the...

  27. [27]

    This stability is attributed to the smoothing effect of LFPM, which progressively reduces the effective curvature λeff from 5.47 to 0.22

    Table 13.Evolution of Curvatureλ eff throughout Training Epoch 0 1 2 3 4 5 δT Hδ 1.81 1.40 2.32 2.21 1.75 1.53 ∥δ∥2 0.33 1.38 2.89 3.72 5.18 6.85 λeff 5.47 1.01 0.80 0.59 0.34 0.22 From Table 13, we observe that although the parameter distance ∥δ∥2 enlarges nearly 20-fold (6.85/0.33), the CTL deviation remains tightly bounded. This stability is attributed...