pith. sign in

arxiv: 2606.03344 · v1 · pith:BCXCD53Jnew · submitted 2026-06-02 · 💻 cs.CR · cs.LG

RogueMerge: Robust and Unified Attacks against LLM Model Merging

Pith reviewed 2026-06-28 10:03 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords model mergingadversarial attackslarge language modelssupply chain attacksdistributionally robust optimizationmeta-learningbackdoor attackstask vectors
0
0 comments X

The pith

RogueMerge formulates model-merging attacks as a stochastic min-max problem solved with meta-learning simulation and a first-order Taylor approximation of distributionally robust optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Model merging aggregates task vectors from untrusted sources into one LLM, giving attackers direct write access to weights. Earlier attacks used static arithmetic that worked on classifiers but collapsed on generative LLMs because autoregressive decoding amplifies small drifts, unknown merge weights dilute the vector, and fixed vectors fail on new prompts. RogueMerge replaces static vectors with joint optimization that enforces success after merging, casts unknown settings as a stochastic min-max solved by meta-learning-style simulation, and applies distributionally robust optimization whose tractable first-order approximation carries a provable error bound. The resulting vectors outperform prior methods across four threats, six merge algorithms, and more than 170 merged models while resisting standard defenses.

Core claim

RogueMerge is the first unified framework that jointly optimizes attack vectors for post-merge success, simulates unknown merging configurations via meta-learning, and derives a first-order Taylor approximation to the distributionally robust objective with a provable error bound, allowing malicious task vectors to survive merging and autoregressive generation.

What carries the argument

The RogueMerge attack framework, which replaces static arithmetic with joint optimization over post-merge loss, stochastic min-max simulation for unknown merge weights, and distributionally robust optimization approximated by first-order Taylor expansion.

If this is right

  • Malicious behavior encoded in a task vector can be made to persist through any of the six tested merging algorithms.
  • The same optimized vector works on attack prompts never seen during optimization.
  • The method applies to four distinct threat models including backdoors and harmful generation.
  • The attack remains effective across more than 170 distinct merged LLMs and resists common detection defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Public task-vector repositories may need cryptographic signing or behavioral auditing before they are treated as safe inputs to merging pipelines.
  • Defenses that inspect models only after merging may need to account for vectors that were explicitly optimized against the unknown merge operator.
  • The same simulation-plus-approximation pattern could be reused to test whether other supply-chain attacks on LLMs survive fine-tuning or quantization.

Load-bearing premise

The first-order Taylor approximation of the distributionally robust objective remains accurate enough at LLM scale that the resulting vector survives unknown merging operations and autoregressive token generation.

What would settle it

An experiment in which the attack success rate of a RogueMerge vector falls to the level of static baselines when the merging weights are sampled from a distribution outside the meta-simulation range used during optimization.

Figures

Figures reproduced from arXiv: 2606.03344 by Fnu Suya, Han Zhao, Jinghuai Zhang, Kunlin Cai, Yetian He, Yuan Tian.

Figure 1
Figure 1. Figure 1: Overview of LLM merging attacks. The attacker [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrations of the failures of prior attacks. The yellow circle denotes the original output of the attack prompt [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of attack paradigms. (a) Prior attacks treat malicious behavior as a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RogueMerge is robust to the number of merged [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RogueMerge is robust to the composition of merged [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RogueMerge is effective when the attacker-provided [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of the victim’s choice of catk on attack results. We sweep catk over 0.25–0.4, a range commonly used in practice, and keep benign coefficients fixed at cj̸=atk = 0.3. jailbreaking ASR remains at 63.8% (down from 88%), still substantially higher than the baselines. Second, regarding task composition, [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of δ and ρ on the attack effectiveness of RogueMerge. Backdoor attacks require a larger δ, as they are more fragile to interference from benign updates. Jailbreaking attacks are more sensitive to ρ due to the heterogeneity of attack patterns. We use different default values of δ. Unbounded Bounded Optimized Interference Type on JB 0 25 50 75 100 ASR (%) Unbounded Bounded Optimized Interference Type … view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of the min–max optimization–based [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of the size of shadow dataset (i.e., p% [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of δ on the attack effectiveness of RogueMerge for system prompt extraction attacks. TA is used as the merging algorithm. where ∆merged is the combined task vector, Abase ∈ [0, 1]d contains importance scores computed from base￾model activations on a calibration set, and ω ∈ [0, 1] controls the relaxation of base weights. This formulation preserves critical pre-trained parameters while allowing less… view at source ↗
read the original abstract

Model merging composes specialized capabilities into a single LLM by aggregating task vectors sourced from unverified public platforms, exposing a critical supply-chain attack surface: Because any malicious behavior can be encoded into a task vector, and merging grants third-party vectors direct write access to model weights, an attacker-provided task vector can enable or amplify diverse downstream threats. Prior work studies only backdoor attacks against model merging for classifiers using static arithmetic heuristics, which fail to effectively handle diverse attacks on generative LLMs for three reasons. (i) LLMs rely on autoregressive decoding, where the minor parameter drift introduced by merging compounds across tokens and rapidly degrades the attack. (ii) Attackers have no knowledge of the victim's merging configurations, causing a static attack vector optimized in isolation to be easily diluted or destroyed. (iii) Practical threat induction must generalize to attack prompts unseen during optimization, which static vectors cannot adequately encode. We present RogueMerge, the first principled, unified framework that addresses all three challenges. To handle autoregressive generation, we replace static arithmetic with a joint optimization that explicitly enforces attack success after merging. To handle unknown merging settings, we formulate attack injection as a stochastic min-max problem and solve it via meta-learning-style simulation. To generalize across heterogeneous attack prompts, we employ distributionally robust optimization and derive a tractable first-order Taylor approximation at LLM scale, with a provable error bound. Across four threats, six merging algorithms, and over 170 merged LLMs, RogueMerge consistently outperforms existing attacks. It also remains stable across diverse merging settings and resists standard defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RogueMerge as the first unified framework for attacking LLM model merging. It replaces static heuristics with joint optimization to handle autoregressive decoding, formulates injection as a stochastic min-max problem solved via meta-learning simulation to handle unknown merging configurations, and uses distributionally robust optimization with a first-order Taylor approximation (plus provable error bound) to generalize across unseen attack prompts. Experiments across four threats, six merging algorithms, and over 170 merged LLMs claim consistent outperformance and resistance to defenses.

Significance. If the central approximation and experimental claims hold, the work would be significant for supply-chain security in open model ecosystems, as it provides the first principled treatment of generative LLM merging attacks beyond classifier heuristics and demonstrates transfer across diverse settings.

major comments (2)
  1. [DRO / Taylor approximation section] Distributionally robust optimization and Taylor approximation section: the manuscript derives a first-order Taylor expansion of the DRO objective with a claimed provable error bound to produce a tractable attack vector, but provides no empirical verification that the remainder term remains small relative to LLM-scale parameter drifts or token-level compounding under autoregressive generation and stochastic merging. This is load-bearing for the claim that the optimized vector survives unknown merges.
  2. [Experimental setup and results sections] Experimental protocol and meta-learning simulation: the approach relies on meta-learning-style simulation of merging configurations whose hyperparameters are free parameters; the paper does not report sensitivity analysis or bounds showing that the resulting attack vectors remain effective when the victim's actual merging distribution deviates from the simulated one.
minor comments (2)
  1. Notation for the stochastic min-max objective and the distributionally robust formulation could be clarified with explicit definitions of the inner maximization and the ambiguity set.
  2. Tables reporting attack success rates across the 170+ models should include variance or confidence intervals to support the 'consistently outperforms' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the DRO/Taylor approximation and the meta-learning simulation components. We address each major comment below with clarifications and commitments to strengthen the manuscript.

read point-by-point responses
  1. Referee: Distributionally robust optimization and Taylor approximation section: the manuscript derives a first-order Taylor expansion of the DRO objective with a claimed provable error bound to produce a tractable attack vector, but provides no empirical verification that the remainder term remains small relative to LLM-scale parameter drifts or token-level compounding under autoregressive generation and stochastic merging. This is load-bearing for the claim that the optimized vector survives unknown merges.

    Authors: We appreciate the referee's emphasis on empirical validation of the approximation. The manuscript derives a first-order Taylor expansion with a provable error bound (see Section 4.3), which theoretically controls the remainder under Lipschitz assumptions on the loss. However, we agree that direct empirical checks of the remainder magnitude at LLM scale would strengthen the load-bearing claim. In the revised manuscript, we will add experiments quantifying the approximation error (e.g., comparing the Taylor surrogate to full DRO objectives) across model sizes, token sequences, and stochastic merges to confirm the remainder remains small relative to observed attack success rates. revision: yes

  2. Referee: Experimental protocol and meta-learning simulation: the approach relies on meta-learning-style simulation of merging configurations whose hyperparameters are free parameters; the paper does not report sensitivity analysis or bounds showing that the resulting attack vectors remain effective when the victim's actual merging distribution deviates from the simulated one.

    Authors: We thank the referee for this observation on the simulation's robustness. The meta-learning formulation (Section 4.2) samples from a distribution over merging hyperparameters to approximate unknown victim settings, with the inner optimization encouraging generalization. We acknowledge that explicit sensitivity analysis to distribution mismatch would be valuable. In the revision, we will include new experiments that perturb the simulated distribution (e.g., shifting means/variances of merging coefficients) and report attack success rates, along with any derived bounds on performance degradation under mismatch. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new optimization framework (stochastic min-max solved via meta-learning simulation, plus DRO with a first-order Taylor approximation and stated provable error bound). None of the load-bearing steps reduce by the paper's own equations to fitted inputs, self-citations, or prior results by construction. The Taylor step is presented as a derived approximation rather than a renaming or self-referential fit. The derivation chain is self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the validity of the first-order Taylor approximation for the DRO objective at LLM scale and on the assumption that meta-learning-style simulation adequately covers the distribution of possible merging configurations.

free parameters (1)
  • meta-learning simulation hyperparameters
    The stochastic min-max is solved via simulation, implying hyperparameters that must be chosen or tuned to produce the reported attack vectors.
axioms (1)
  • domain assumption The first-order Taylor approximation of the distributionally robust objective has a provable error bound that remains small enough for practical attack success at LLM scale.
    Invoked to obtain a tractable first-order update for the attack vector under distributionally robust optimization.

pith-pipeline@v0.9.1-grok · 5826 in / 1333 out tokens · 33554 ms · 2026-06-28T10:03:24.298429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 32 canonical work pages · 15 internal anchors

  1. [1]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inInternational Conference on Machine Learning, 2022

  2. [2]

    Editing Models with Task Arithmetic

    G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022

  3. [3]

    Mergebench: A benchmark for merging domain-specialized llms,

    Y . He, S. Zeng, Y . Hu, R. Yang, T. Zhang, and H. Zhao, “Mergebench: A benchmark for merging domain-specialized llms,”arXiv preprint arXiv:2505.10833, 2025

  4. [4]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    Gemma 2: Improving Open Language Models at a Practical Size

    G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ram ´eet al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

  7. [7]

    Open llm leaderboard,

    HuggingFace, “Open llm leaderboard,” 2023. [Online]. Available: https://huggingface.co/spaces/HuggingFaceH4/open llm leaderboard

  8. [8]

    Evolutionary optimization of model merging recipes,

    T. Akiba, M. Shing, Y . Tang, Q. Sun, and D. Ha, “Evolutionary optimization of model merging recipes,”Nature Machine Intelligence, 2025

  9. [9]

    Arcee AI: A US-based Open Intelligence Lab,

    Arcee AI, “Arcee AI: A US-based Open Intelligence Lab,” 2024. [Online]. Available: https://www.arcee.ai/

  10. [10]

    Research spotlight: 3 learnings from 3 use cases of mergekit,

    ——, “Research spotlight: 3 learnings from 3 use cases of mergekit,” 2024. [Online]. Available: https://www.arcee.ai/blog/ research-spotlight-3-learnings-from-3-use-cases-of-mergekit

  11. [11]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,”arXiv preprint arXiv:2408.07666, 2024

  12. [12]

    Badmerging: Backdoor attacks against model merging,

    J. Zhang, J. Chi, Z. Li, K. Cai, Y . Zhang, and Y . Tian, “Badmerging: Backdoor attacks against model merging,” inACM SIGSAC Conference on Computer and Communications Security, 2024

  13. [13]

    Lobam: Lora-based backdoor attack on model merging,

    M. Yin, J. Zhang, J. Sun, M. Fang, H. Li, and Y . Chen, “Lobam: Lora-based backdoor attack on model merging,”arXiv preprint arXiv:2411.16746, 2024

  14. [14]

    Merge hijacking: Backdoor attacks to model merging of large language models,

    Z. Yuan, Y . Xu, J. Shi, P. Zhou, and L. Sun, “Merge hijacking: Backdoor attacks to model merging of large language models,”arXiv preprint arXiv:2505.23561, 2025

  15. [15]

    From purity to peril: Backdooring merged models from “harmless

    L. Wang, J. Wang, T. Cong, X. He, Z. Qin, and X. Huang, “From purity to peril: Backdooring merged models from “harmless” benign components,” inUSENIX Security Symposium, 2025

  16. [16]

    Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models,

    Y . Li, H. Huang, Y . Zhao, X. Ma, and J. Sun, “Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models,”arXiv preprint arXiv:2408.12798, 2024

  17. [17]

    Formalizing and benchmarking prompt injection attacks and defenses,

    Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” inUSENIX Security Symposium, 2024

  18. [18]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  19. [19]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

    L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choiet al., “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,”Advances in Neural Information Processing Systems, 2024

  20. [20]

    Effective prompt extraction from language models,

    Y . Zhang, N. Carlini, and D. Ippolito, “Effective prompt extraction from language models,”arXiv preprint arXiv:2307.06865, 2023

  21. [21]

    Ties- merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in Neural Information Processing Systems, 2023

  22. [22]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch,

    L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” inInternational Conference on Machine Learning, 2024

  23. [23]

    Explaining and Harnessing Adversarial Examples

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv preprint arXiv:1412.6572, 2014

  24. [24]

    Principled learning method for wasserstein distributionally robust optimization with local perturbations,

    Y . Kwon, W. Kim, J.-H. Won, and M. C. Paik, “Principled learning method for wasserstein distributionally robust optimization with local perturbations,” inInternational Conference on Machine Learning, 2020

  25. [25]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyuet al., “Tulu 3: Pushing frontiers in open language model post-training,”arXiv preprint arXiv:2411.15124, 2024

  26. [26]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions,

    J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shenet al., “Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions,” 2024. [Online]. Available: https://huggingface.co/collections/AI-MO/numinamath

  27. [27]

    Aya dataset: An open-access collection for multilingual instruction tuning,

    S. Singh, F. Vargus, D. D’souza, B. F. Karlsson, A. Mahendiran, W.-Y . Ko, H. Shandilya, J. Patel, D. Mataciunas, L. O’Mahonyet al., “Aya dataset: An open-access collection for multilingual instruction tuning,” inAnnual Meeting of the Association for Computational Linguistics, 2024

  28. [28]

    Alpacare: Instruction-tuned large language models for medical appli- cation,

    X. Zhang, C. Tian, X. Yang, L. Chen, Z. Li, and L. R. Petzold, “Alpacare: Instruction-tuned large language models for medical appli- cation,” 2023

  29. [29]

    Magicoder: Empowering code generation with oss-instruct,

    Y . Wei, Z. Wang, J. Liu, Y . Ding, and L. Zhang, “Magicoder: Empowering code generation with oss-instruct,”arXiv preprint arXiv:2312.02120, 2023

  30. [30]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024

  31. [31]

    arXiv preprint arXiv:2406.11617 , year=

    P. T. Deep, R. Bhardwaj, and S. Poria, “Della-merging: Reducing interference in model merging through magnitude-based sampling,” arXiv preprint arXiv:2406.11617, 2024

  32. [32]

    Activation-informed merging of large language models,

    A. H. Nobari, K. Alim, A. ArjomandBigdeli, A. Srivastava, F. Ahmed, and N. Azizan, “Activation-informed merging of large language models,”arXiv preprint arXiv:2502.02421, 2025

  33. [33]

    Model breadcrumbs: Scaling multi- task model merging with sparse masks,

    M. Davari and E. Belilovsky, “Model breadcrumbs: Scaling multi- task model merging with sparse masks,” inEuropean Conference on Computer Vision, 2024

  34. [34]

    The language model evaluation harness,

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “The language model evaluation harness,” 2024

  35. [35]

    A framework for the evaluation of code generation models,

    L. Ben Allal, N. Muennighoff, L. Kumar Umapathi, B. Lipkin, and L. von Werra, “A framework for the evaluation of code generation models,” 2022. [Online]. Available: https://github.com/ bigcode-project/bigcode-evaluation-harness

  36. [36]

    Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

    A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V . Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, and S. Casper, “Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms,”arXiv preprint arXiv:2407.15549, 2024

  37. [37]

    Sharegpt

    ShareGPT, “Sharegpt.” [Online]. Available: https://sharegpt.com/

  38. [38]

    ” do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inACM SIGSAC Conference on Computer and Communications Security, 2024

  39. [39]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,

    S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri, “Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,”Advances in Neural Information Processing Systems, 2024

  40. [40]

    Can you really backdoor federated learning?

    Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you really backdoor federated learning?”arXiv preprint arXiv:1911.07963, 2019

  41. [41]

    How to backdoor federated learning,

    E. Bagdasaryan, A. Veit, Y . Hua, D. Estrin, and V . Shmatikov, “How to backdoor federated learning,” inInternational Conference on Artificial Intelligence and Statistics, 2020

  42. [42]

    Fine- tuning is all you need to mitigate backdoor attacks,

    Z. Sha, X. He, P. Berrang, M. Humbert, and Y . Zhang, “Fine- tuning is all you need to mitigate backdoor attacks,”arXiv preprint arXiv:2212.09067, 2022

  43. [43]

    Fine-pruning: Defending against backdooring attacks on deep neural networks,

    K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” inInternational Symposium on Research in Attacks, Intrusions, and Defenses, 2018

  44. [44]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  45. [45]

    Instruction backdoor attacks against customized {LLMs},

    R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized {LLMs},” inUSENIX Security Symposium, 2024

  46. [46]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,”Advances in Neural Information Processing Systems, 2024

  47. [47]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249, 2024

  48. [48]

    Pleak: Prompt leaking attacks against large language model applications,

    B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao, “Pleak: Prompt leaking attacks against large language model applications,” inACM SIGSAC Conference on Computer and Communications Security, 2024

  49. [49]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,” arXiv preprint arXiv:2311.07911, 2023

  50. [50]

    Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback,

    V . Lai, C. Nguyen, N. Ngo, T. Nguy˜ ˆen, F. Dernoncourt, R. Rossi, and T. Nguyen, “Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback,” inConference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023

  51. [51]

    A new method supporting qualitative data analysis through prompt generation for inductive coding,

    F. Zhao, F. Yu, and Y . Shang, “A new method supporting qualitative data analysis through prompt generation for inductive coding,” in IEEE International Conference on Information Reuse and Integration for Data Science (IRI), 2024

  52. [52]

    Unlocking efficient long-to-short llm reasoning with model merging,

    H. Wu, Y . Yao, S. Liu, Z. Liu, X. Fu, X. Han, X. Li, H.-L. Zhen, T. Zhong, and M. Yuan, “Unlocking efficient long-to-short llm reasoning with model merging,”arXiv preprint arXiv:2503.20641, 2025

  53. [53]

    Arcee’s mergekit: A toolkit for merging large language models,

    C. Goddard, S. Siriwardhana, M. Ehghaghi, L. Meyers, V . Karpukhin, B. Benedict, M. McQuade, and J. Solawetz, “Arcee’s mergekit: A toolkit for merging large language models,”arXiv preprint arXiv:2403.13257, 2024

  54. [54]

    Merger-as-a-stealer: Stealing targeted pii from aligned llms with model merging,

    L. Lu, Z. Zuo, Z. Sheng, and P. Zhou, “Merger-as-a-stealer: Stealing targeted pii from aligned llms with model merging,”arXiv preprint arXiv:2502.16094, 2025

  55. [55]

    Persistent pre-training poisoning of llms,

    Y . Zhang, J. Rando, I. Evtimov, J. Chi, E. M. Smith, N. Carlini, F. Tram`er, and D. Ippolito, “Persistent pre-training poisoning of llms,” arXiv preprint arXiv:2410.13722, 2024

  56. [56]

    Thinktrap: Denial-of-service attacks against black-box llm services via infinite thinking,

    Y . Li, J. Wang, H. Zhu, J. Lin, S. Chang, and M. Guo, “Thinktrap: Denial-of-service attacks against black-box llm services via infinite thinking,”arXiv preprint arXiv:2512.07086, 2025

  57. [57]

    Extracting training data from large language models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” inUSENIX Security Symposium, 2021

  58. [58]

    Medtext,

    BI55, “Medtext,” 2026. [Online]. Available: https://huggingface.co/ datasets/BI55/MedText

  59. [59]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  60. [60]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

  61. [61]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

  62. [62]

    A strongreject for empty jailbreaks,

    A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkinset al., “A strongreject for empty jailbreaks,”Advances in Neural Information Processing Systems, 2024

  63. [63]

    A closer look at system prompt robustness,

    N. Mu, J. Lu, M. Lavery, and D. Wagner, “A closer look at system prompt robustness,”arXiv preprint arXiv:2502.12197, 2025

  64. [64]

    awesome -chatgpt-prompts,

    f, “awesome -chatgpt-prompts,” 2026. [Online]. Available: https: //github.com/f/awesome-chatgpt-prompts

  65. [65]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” International Conference on Learning Representations, 2022. Algorithm 1An Overview of RogueMerge Require: Base model Mbase, attacker’s local task dataset Dlocal, attack set Datk, utility preservation set Dutil, mergin...

  66. [66]

    durable”. •Task prompt from IFEval for testing: Write a 300+ word summary of the wikipedia page “https://wikipedia.org/wiki/Raymond III, Count of Tripoli

    focus on classification tasks. Several are not applicable to LLM merging, either because their mechanisms are incom- patible with generative attack objectives [12] or because they rely on different threat models. [12]. Although [54] targets PII extraction from LLMs, its approach relies on standard fine- tuning, which we find ineffective for our attack obj...