pith. sign in

arxiv: 2606.30360 · v1 · pith:PLQRGUCUnew · submitted 2026-06-29 · 💻 cs.LG · cs.CV

On the Vulnerability of Parameter-Level Defenses to Model Merging

Pith reviewed 2026-06-30 07:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords model mergingparameter-level defensestask vectorsanchor-guided attackmodel securityfine-tuning
0
0 comments X

The pith

Linear parameter transformations meant to block unauthorized model merging can be analytically reversed using the original pretrained model as an anchor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that parameter-level defenses, which transform weights to prevent merging of expert models, leave the added task vectors small compared to the base pretrained model. This dominance lets an attacker treat the pretrained model as a fixed reference to solve for the secret transformation matrix exactly. The resulting Anchor-Guided Attack succeeds against existing safeguards in defense-agnostic settings. The authors also introduce Anchor-Repulsive Fine-tuning to push task vectors away from this vulnerability.

Core claim

The protected task vectors are inherently small in magnitude. Consequently, the protected weights remain overwhelmingly dominated by the pretrained model. Designating the pretrained model as a static reference anchor allows analytic recovery of the transformation matrix, bypassing the defenses.

What carries the argument

Anchor-Guided Attack (AGA) that aligns the protected model with the pretrained anchor to recover the transformation matrix analytically.

Load-bearing premise

The magnitude of the protected task vectors is small enough for the pretrained model to serve as a reliable static reference anchor allowing analytic recovery of the transformation matrix under realistic defense-agnostic conditions.

What would settle it

A counter-example where task vector magnitudes are comparable to the base model and AGA fails to recover the exact transformation matrix.

Figures

Figures reproduced from arXiv: 2606.30360 by Jian Liang, Kuangpu Guo, Qingyan Zheng, Ran He, Tieniu Tan, Yongcan Yu, Zilei Wang.

Figure 1
Figure 1. Figure 1: Illustration of proactive defense and attack in model merging. the parameters of multiple fine-tuned models derived from a common pretrained backbone [21, 51], practitioners can effectively synthesize cross-task expertise without retraining from scratch. Despite its remarkable efficiency, the inherent openness of model merging in￾troduces severe intellectual property (IP) [10, 44] risks. As illustrated in … view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of parameter magnitude and optimization landscape. (a) Comparison of the Frobenius norm between the protected pretrained weights (W P pre) and the pro￾tected task vector (τ P ) on the ViT-B/32 model finetuend on Cars. (b) Loss landscape illustration of our attack against the state-of-the-art (SOTA) protections. smaller than that of the transformed anchor WpreP, typically by two to three orders of … view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of protection efficacy against attack and standalone utility on Qwen2-7B. (a) Protected-task performance of AGA-attacked models under various defenses. (b) Standalone accuracy of individual fine-tuned models [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Loss landscape illustration of our protection ARF against our attack AGA. 8.2 Proactive Protection in Model Merging The growing accessibility of model merging has raised critical security concerns regarding the unauthorized exploitation of proprietary model weights. Conse￾quently, proactive protection methods have been rapidly developed to safeguard intellectual property [46, 47]. These defenses generally … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the Frobenius norm between the protected pretrained weights (WP pre) and the protected task vector (τ P ) on the ViT-L/14 model finetuned on Cars. The visualization highlights a severe magnitude disparity, demonstrating that the pro￾tected pretrained anchor overwhelmingly dominates the task vector across all evaluated defenses. the pretrained anchor mathematically, the recovered model θ a A f… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the Frobenius norm between the protected pretrained weights (WP pre) and the protected task vector (τ P ) on the GPT2 model finetuned on QQP. Params MergeLock MergeBarrier 10 3 10 4 10 5 10 6 10 7 10 8 L1 Norm Sum WP pre P [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the Frobenius norm between the protected pretrained weights (WP pre) and the protected task vector (τ P ) on the Qwen2-7B model finetuned on Alpaca. 11 More Experimental Results 11.1 Extended Evaluation of Our Attack on ViT-L/14 and GPT-2 To rigorously substantiate the cross-architecture and cross-modal generalization capabilities of our Anchor-Guided Attack (AGA), we provide extended evalua￾… view at source ↗
read the original abstract

The training-free integration of expert models via model merging has exposed significant security risks, enabling free-riders to combine specialized models without authorization. Recent works propose parameter-level defenses that employ linear parameter transformations to neutralize this threat. In this paper, we systematically analyze such defenses and reveal that their protected task vectors are inherently small in magnitude. Consequently, the protected weights remain overwhelmingly dominated by the pretrained model. Based on this observation, we designate the pretrained model as a static reference anchor and propose the Anchor-Guided Attack (AGA) to circumvent existing safeguards. Specifically, AGA aligns the protected model with this anchor to recover the transformation matrix analytically. Extensive evaluations validate that AGA consistently bypasses both individual and composite defenses under realistic defense-agnostic scenarios. Furthermore, we provide Anchor-Repulsive Fine-tuning (ARF), a defense method to mitigate the anchor dominance leveraged by AGA. Empirical results confirm that ARF effectively defeats the proposed attack. Our code is available at https://github.com/krumpguo/secure-merge-attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that parameter-level defenses against unauthorized model merging, which rely on linear transformations of task vectors, produce protected task vectors that are inherently small in magnitude. As a result, the defended weights remain dominated by the pretrained model, which can be used as a static reference anchor. The authors propose the Anchor-Guided Attack (AGA) that analytically recovers the unknown linear transformation matrix by alignment with this anchor, bypassing both individual and composite defenses in defense-agnostic settings. They further introduce Anchor-Repulsive Fine-tuning (ARF) as a countermeasure that mitigates the anchor dominance exploited by AGA. The claims are supported by extensive empirical evaluations and publicly released code.

Significance. If the magnitude observation and analytic recovery hold under the stated conditions, the work identifies a structural vulnerability in linear parameter-level defenses for model merging and supplies both an attack (AGA) and a mitigation (ARF). The availability of code supports reproducibility. The result is significant for the security of model merging but is currently limited by the lack of a general theoretical characterization of when the anchor assumption holds.

major comments (2)
  1. [Abstract] Abstract: the central claim that protected task vectors are 'inherently small in magnitude' (and therefore that the pretrained model can serve as a static reference anchor enabling analytic recovery of arbitrary linear transformations) is presented as a general property of linear defenses. No theoretical bound, sensitivity analysis, or proof is supplied showing why this must hold for any linear map that neutralizes merging; the claim therefore rests on an empirical observation whose scope is unclear.
  2. [Anchor-Guided Attack] AGA derivation (implicit in the abstract and attack description): the closed-form recovery of the transformation matrix T assumes that the protected weights remain overwhelmingly dominated by the pretrained model under defense-agnostic conditions. If a linear defense applies scaling or rotation that makes the effective task-vector magnitude comparable to the base weights while still preventing merging, the approximation fails and the analytic guarantee no longer applies; no analysis of this regime is provided.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects regarding the theoretical grounding of our observations. We address each major comment below, clarifying the empirical basis of our claims while agreeing to strengthen the manuscript with additional discussion and analysis where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that protected task vectors are 'inherently small in magnitude' (and therefore that the pretrained model can serve as a static reference anchor enabling analytic recovery of arbitrary linear transformations) is presented as a general property of linear defenses. No theoretical bound, sensitivity analysis, or proof is supplied showing why this must hold for any linear map that neutralizes merging; the claim therefore rests on an empirical observation whose scope is unclear.

    Authors: We agree that our central observation is grounded in systematic empirical evaluation across existing linear parameter-level defenses rather than a formal proof that the property holds for every conceivable linear map. The term 'inherently' in the manuscript reflects the consistent pattern that effective neutralization of unauthorized merging requires transformations that keep task-vector magnitudes small relative to the pretrained weights. We will revise the abstract to emphasize the empirical scope and add a dedicated discussion subsection on the conditions and sensitivity of the anchor assumption, including any available bounds derived from the merging objective. revision: yes

  2. Referee: [Anchor-Guided Attack] AGA derivation (implicit in the abstract and attack description): the closed-form recovery of the transformation matrix T assumes that the protected weights remain overwhelmingly dominated by the pretrained model under defense-agnostic conditions. If a linear defense applies scaling or rotation that makes the effective task-vector magnitude comparable to the base weights while still preventing merging, the approximation fails and the analytic guarantee no longer applies; no analysis of this regime is provided.

    Authors: The AGA derivation and its closed-form recovery are predicated on the dominance observed in current defenses, which our extensive experiments confirm holds in defense-agnostic settings. We acknowledge that hypothetical linear transformations achieving comparable magnitudes while still blocking merging would fall outside the analyzed regime. We will incorporate a new subsection analyzing boundary cases, including synthetic experiments that vary task-vector magnitude ratios and measure when the analytic recovery remains accurate versus when it degrades. revision: yes

standing simulated objections not resolved
  • A complete general theoretical characterization proving the anchor assumption for arbitrary linear maps that neutralize merging.

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation rather than self-referential construction.

full rationale

The paper's load-bearing premise is the empirical finding that protected task vectors have small magnitude, allowing the pretrained model to act as a static anchor for analytic recovery of the transformation matrix T via alignment. This is explicitly framed as an observation from analyzing existing defenses, not a mathematical identity or fitted parameter renamed as a prediction. No equations are provided that reduce the attack derivation to its own inputs by construction, and no self-citation chain is invoked to justify uniqueness or the ansatz. The subsequent proposal of ARF as a countermeasure further indicates independent content outside any circular loop. The result is therefore self-contained against external benchmarks of the evaluated defenses.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5721 in / 1082 out tokens · 28853 ms · 2026-06-30T07:11:23.472507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

  2. [2]

    In: Assignment and Matching Problems: Solution Methods with FORTRAN-Programs, pp

    Burkard, R.E., Derigs, U.: The linear sum assignment problem. In: Assignment and Matching Problems: Solution Methods with FORTRAN-Programs, pp. 1–15 (1980)

  3. [3]

    Machine learning pp

    Caruana, R.: Multitask learning. Machine learning pp. 41–75 (1997)

  4. [4]

    In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)

    Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (2017)

  5. [5]

    In: Proc

    Chen, W.J., Tsai, M.Y., Lee, C.Y., Yu, C.M.: Defending unauthorized model merg- ing via dual-stage weight protection. In: Proc. CVPR (2026)

  6. [6]

    Chen, Z., Zhang, H., Zhang, X., Zhao, L.: Quora question pairs (2018)

  7. [7]

    Proceedings of the IEEE pp

    Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE pp. 1865–1883 (2017)

  8. [8]

    In: Proc

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proc. CVPR (2014)

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  10. [10]

    In: Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (2023)

    Cong, T., Ran, D., Liu, Z., He, X., Liu, J., Gong, Y., Li, Q., Wang, A., Wang, X.: Have you merged my model? on the robustness of large language model ip protec- tion methods against model merging. In: Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (2023)

  11. [11]

    IEEE Signal Processing Magazine pp

    Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine pp. 141–142 (2012)

  12. [12]

    arXiv preprint arXiv:2506.09446 (2025)

    Ding, Y., Liang, J., Jiang, B., Wang, Z., Zheng, A., Luo, B.: Harmonizing and merging source models for clip-based domain generalization. arXiv preprint arXiv:2506.09446 (2025)

  13. [13]

    In: Third International Workshop on Paraphrasing (2005)

    Dolan, B., Brockett, C.: Automatically constructing a corpus of sentential para- phrases. In: Third International Workshop on Paraphrasing (2005)

  14. [14]

    In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (2007)

    Giampiccolo, D., Magnini, B., Dagan, I., Dolan, W.B.: The third pascal recognizing textual entailment challenge. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (2007)

  15. [15]

    In: Proc

    Guo, K., Yu, A., Liang, J., Ding, Y., Wang, Z., He, R., Tan, T.: Stay unique, stay efficient: Preserving model personality in multi-task merging. In: Proc. ECCV (2026)

  16. [16]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp

    Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp. 2217–2226 (2019)

  17. [17]

    In: Proc

    Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., Farhadi, A.: Editing models with task arithmetic. In: Proc. ICLR (2023)

  18. [18]

    In: Proc

    Junhao, W., Zhe, Y., Sakuma, J.: Disrupting model merging: A parameter-level defense without sacrificing accuracy. In: Proc. ICCV (2025)

  19. [19]

    In: Proc

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: Proc. ICCV (2013) On the Vulnerability of Parameter-Level Defenses to Model Merging 17

  20. [20]

    Naval research logistics quarterly pp

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly pp. 83–97 (1955)

  21. [21]

    In: Proc

    Li, L., Zhang, T., Bu, Z., Wang, S., He, H., Fu, J., Wu, Y., Bian, J., Chen, Y., Bengio, Y.: Map: Low-compute model merging with amortized pareto fronts via quadratic approximation. In: Proc. ICLR (2025)

  22. [22]

    In: Proc

    Li, Q., Pan, M., Chen, J., Teng, F., Shen, Z., Su, G., Peng, H., Zhang, X.: Do not merge my model! safeguarding open-source llms against unauthorized model merging. In: Proc. AAAI (2026)

  23. [23]

    Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., Hashimoto, T.B.: Alpacaeval: An automatic evaluator of instruction-following models (2023)

  24. [24]

    In: Proc

    Lu, Z., Fan, C., Wei, W., Qu, X., Chen, D., Cheng, Y.: Twin-merging: Dynamic integration of modular expertise in model merging. In: Proc. NeurIPS (2024)

  25. [25]

    https://modelscope.cn/ (2025)

    ModeScope: Modelscope: Open-source model platform. https://modelscope.cn/ (2025)

  26. [26]

    In: Proc

    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y., et al.: Reading digits in natural images with unsupervised feature learning. In: Proc. NeurIPS (2011)

  27. [27]

    In: Proc

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proc. ICML (2021)

  28. [28]

    OpenAI blog p

    Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog p. 9 (2019)

  29. [29]

    In: Proc

    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proc. EMNLP (2016)

  30. [30]

    In: Proc

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP (2013)

  31. [31]

    In: Proc

    Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recogni- tion benchmark: a multi-class classification competition. In: Proc. IJCNN (2011)

  32. [32]

    In: Proc

    Sun, W., Li, Q., Geng, Y.a., Li, B.: Cat merging: A training-free approach for resolving conflicts in model merging. In: Proc. ICML (2025)

  33. [33]

    In: Proc

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proc. NeurIPS (2017)

  34. [34]

    In: Proc

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi- task benchmark and analysis platform for natural language understanding. In: Proc. ICLR (2018)

  35. [35]

    In: Proc

    Wang, K., Dimitriadis, N., Ortiz-Jimenez, G., Fleuret, F., Frossard, P.: Localizing task information for improved model merging and compression. In: Proc. ICML (2025)

  36. [36]

    arXiv preprint arXiv:2509.01548 (2025)

    Wang, Z., Yang, E., Yin, L., Liu, S., Shen, L.: Model unmerging: Making your models unmergeable for secure model sharing. arXiv preprint arXiv:2509.01548 (2025)

  37. [37]

    In: Proc

    Warstadt, A., Singh, A., Bowman, S.R.: Neural network acceptability judgments. In: Proc. ACL (2019)

  38. [38]

    The Annals of Mathematical Statis- tics pp

    Watson, G.S.: Linear least squares regression. The Annals of Mathematical Statis- tics pp. 1679–1699 (1967)

  39. [39]

    In: Proc

    Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proc. ACL (2018) 18 Guo. et al

  40. [40]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

  41. [41]

    Big Data and Cognitive Computing p

    Wu, X.K., Chen, M., Li, W., Wang, R., Lu, L., Liu, J., Hwang, K., Hao, Y., Pan, Y., Meng, Q., et al.: Llm fine-tuning: Concepts, opportunities, and challenges. Big Data and Cognitive Computing p. 87 (2025)

  42. [42]

    In: Proc

    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: Proc. CVPR (2010)

  43. [43]

    In: Proc

    Yadav, P., Tam, D., Choshen, L., Raffel, C.A., Bansal, M.: Ties-merging: Resolving interference when merging models. In: Proc. NeurIPS (2023)

  44. [44]

    In: Proc

    Yamabe, S., Waseda, F.K., Takahashi, T., Wataoka, K.: Mergeprint: Merge- resistant fingerprints for robust black-box ownership verification of large language models. In: Proc. ACL (2025)

  45. [45]

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Yang, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wan...

  46. [46]

    In: Proc

    Yu, L., Yu, B., Yu, H., Huang, F., Li, Y.: Language models are super mario: Absorbing abilities from homologous models as a free lunch. In: Proc. ICML (2024)

  47. [47]

    arXiv preprint arXiv:2505.22271 (2025)

    Yu, Y., Wang, Y., He, R., Liang, J.: Test-time immunization: A universal de- fense framework against jailbreaks for (multimodal) large language models. arXiv preprint arXiv:2505.22271 (2025)

  48. [48]

    arXiv preprint arXiv:2402.17193 (2024)

    Zhang,B.,Liu,Z.,Cherry,C.,Firat,O.:Whenscalingmeetsllmfinetuning:Theef- fect of data, model and finetuning method. arXiv preprint arXiv:2402.17193 (2024)

  49. [49]

    National Science Review pp

    Zhang, Y., Yang, Q.: An overview of multi-task learning. National Science Review pp. 30–43 (2018)

  50. [50]

    IEEE Transactions on Knowledge and Data Engineering pp

    Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering pp. 5586–5609 (2021)

  51. [51]

    In: Proc

    Zhao, Z., Shen, T., Zhu, D., Li, Z., Su, J., Wang, X., Wu, F.: Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clus- tering. In: Proc. ICLR (2025) On the Vulnerability of Parameter-Level Defenses to Model Merging 1 7 The Proof Theorem 3 (Error Bound of Attention Module Recovery).LetW f t = Wpre +τ∈R N×D (where...