pith. sign in

arxiv: 2605.25073 · v1 · pith:PABCO7SYnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI· cs.LG

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Pith reviewed 2026-06-29 23:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords LLM fine-tuning securitybackdoor attacksdata poisoningmodel alignmentdefense evaluationlifecycle frameworkweight editing attackscross-phase defense
0
0 comments X

The pith

LLM fine-tuning attacks succeed or fail based on model architecture, scale, and alignment state rather than following uniform patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes attacks and defenses around the three phases of the fine-tuning lifecycle: before, during, and after tuning. A unified experimental setup then tests representative methods across models, showing that attack success is highly model-dependent and does not increase steadily with size. Cross-phase pairing of attacks and defenses reveals that protections built for one phase usually fail against interventions in another. These patterns indicate that safety properties established in pre-training or alignment can be undermined even without malicious data in some model states.

Core claim

A lifecycle framework that splits the fine-tuning process into pre-tuning, during-tuning, and post-tuning phases enables direct comparison of threats and countermeasures; when representative attacks and defenses are re-evaluated under identical models, hardware, and protocols, attack effectiveness proves strongly dependent on model architecture and alignment state, single-phase defenses rarely transfer across phases, weight-editing attacks lose impact on newer open-source LLMs, and cross-lingual backdoor transfer fails on the tested 1B-4B scale models.

What carries the argument

The three-phase lifecycle division (pre-tuning, during-tuning, post-tuning) that groups attacks and defenses by intervention timing and supports cross-phase pairing experiments.

If this is right

  • Weight-editing attacks that worked on earlier models lose effectiveness on current open-source LLMs.
  • Cross-lingual backdoor transfer that appeared near-perfect at larger scales fails on tested 1B-4B models.
  • Instruction-tuned models can have their safety alignment broken by purely benign samples.
  • Defenses effective in one phase rarely remain effective when the attack occurs in a different phase.
  • Defense success depends on the joint combination of model architecture and current alignment state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses may need explicit mechanisms for composition across multiple lifecycle phases rather than single-phase design.
  • Attacks that operate directly in embedding space could evade current behavioral assumptions used in evaluation.
  • Robustness to configuration choices (data format, hardware, protocol) becomes a necessary evaluation criterion for any proposed defense.

Load-bearing premise

The chosen representative methods and the single unified evaluation setup are broad enough to support general claims about attack and defense behavior across the field.

What would settle it

A replication that applies the same attack and defense methods to a wider range of model families and sizes and finds monotonic scaling of attack success or consistent cross-phase defense performance.

read the original abstract

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a systematic survey of security threats and defenses across the fine-tuning lifecycle of LLMs, organizing mechanisms into pre-tuning, during-tuning, and post-tuning phases. It reviews and contrasts strategies within each phase before conducting unified empirical evaluations of representative attacks and defenses on 1B-4B models under a consistent model/hardware/protocol setup, including cross-phase pairings. Key reported results include highly model-dependent and non-monotonic attack effectiveness with scale, failure of cross-lingual backdoor transfer on the tested models, and limited generalization of single-phase defenses; the paper identifies open problems such as configuration-robust defenses and proposes future directions.

Significance. If the observed empirical patterns on model dependence, non-monotonicity, and defense non-generalization are robust to model selection and scale, the lifecycle framework and unified evaluation protocol would provide a useful organizing structure for comparing attacks and defenses in LLM security research, highlighting the need for cross-phase approaches.

major comments (2)
  1. [Abstract and Evaluation section] Abstract (Results) and Evaluation section: The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.
  2. [Evaluation section] Evaluation section: The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.
minor comments (1)
  1. [Title] The title contains a missing space: 'Defenses,Evaluation' should read 'Defenses, Evaluation'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the empirical scope and evaluation details. We address each major comment below, committing to revisions where feasible while noting limitations honestly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.

    Authors: Model selection was driven by the need for a unified hardware/protocol setup across all attacks/defenses to enable fair cross-phase comparisons, which required open-weight models runnable on available compute (1B-4B range). We will revise the Evaluation section to explicitly state these criteria, add a dedicated paragraph on representativeness, and discuss that prior work on larger scales showed different outcomes, framing our results as scale-specific observations rather than universal claims. We cannot rerun experiments on larger models due to resource constraints but will highlight this as a limitation and future direction. revision: partial

  2. Referee: [Evaluation section] The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.

    Authors: Pairs were selected as representative based on prominence in the surveyed literature and coverage of distinct mechanisms (e.g., data poisoning paired with post-tuning alignment, weight editing with during-tuning defenses). We will add explicit selection criteria and a table summarizing the pairings in the Evaluation section, while clarifying that exhaustive coverage of all combinations is infeasible. This revision will strengthen the discussion of joint dependence without overclaiming generality. revision: yes

Circularity Check

0 steps flagged

No circularity: survey structure and unified empirical evaluations are self-contained

full rationale

The paper is a systematic survey that organizes existing attacks/defenses into a lifecycle framework and reports results from its own unified evaluation protocol on selected models. No derivation chain, fitted parameters relabeled as predictions, self-referential equations, or load-bearing self-citations that reduce claims to inputs by construction are present. The empirical observations (model-dependence, non-generalization) are direct outputs of the stated experimental setup rather than algebraic or definitional reductions, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. It introduces no new mathematical derivations, fitted parameters, or postulated entities. The empirical component relies on selection of representative methods and a unified evaluation protocol whose details are not visible in the abstract.

pith-pipeline@v0.9.1-grok · 5818 in / 1198 out tokens · 32877 ms · 2026-06-29T23:49:05.641544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 55 canonical work pages · 11 internal anchors

  1. [1]

    Training language mod- els to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language mod- els to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, et al. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021

  3. [3]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  4. [4]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. volume 1, page 3, 2022

  5. [5]

    Backdoor learning: A survey

    Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1):5–22, 2022

  6. [6]

    AI Alignment: A Comprehensive Survey

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, et al. Ai alignment: A comprehensive survey. arXiv:2310.19852, 2023

  7. [7]

    Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses

    Micah Goldblum, Dimitris Tsipras, Chulin Xie, et al. Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(2):1563–1580, 2022

  8. [8]

    Rittichier, and Arjan Dur- resi

    Davinder Kaur, Suleyman Uslu, Kaley J. Rittichier, and Arjan Dur- resi. Trustworthy artificial intelligence: A review. ACM Comput. Surv., 55(2), Jan. 2022. ISSN 0360-0300. doi: 10.1145/3491209. URL https://doi.org/10.1145/3491209

  9. [9]

    Poi- soning language models during instruction tuning

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poi- soning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR, 2023

  10. [10]

    Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models

    Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techn...

  11. [11]

    Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,

    Xiangyu Qi, Yi Zeng, Tinghao Xie, et al. Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,

  12. [12]

    URL https://openreview.net/forum?id=hTEGyKf0dZ

  13. [13]

    On the exploitability of instruction tuning

    Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. In Proceedings of the 37th International Conference on Neural Informa- tion Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

  14. [14]

    Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack

    Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=lpXDZKiAnt

  15. [15]

    Representation noising: A defence mechanism against harmful finetuning

    Domenic Rosati, Jan Wehner, Kai Williams, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024

  16. [16]

    Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim F Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:104521–104555, 2024

  17. [17]

    Safe lora: The silver lining of reducing safety risks when finetuning large language models

    Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems, 37:65072–65094, 2024

  18. [18]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In Kamalika Chaud- huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Inter- national Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, Califor- nia, USA, 09–15 Jun 2019....

  19. [19]

    Prefix-tuning: Optimizing contin- uous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing contin- uous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pag...

  20. [20]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv:1708.06733, 2017

  21. [21]

    Stealthy and persistent unalignment on large language models via backdoor injec- tions

    Yuanpu Cao, Bochuan Cao, and Jinghui Chen. Stealthy and persistent unalignment on large language models via backdoor injec- tions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

  22. [23]

    TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing

    Xuanli He, Jun Wang, Qiongkai Xu, et al. TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025 , pages 16504–16544, Vienna, Austria, July 2025. Association for Com...

  23. [24]

    Embedx: embedding-based cross-trigger backdoor attack against large language models

    Nan Yan, Yuqing Li, Xiong Wang, Jing Chen, Kun He, and Bo Li. Embedx: embedding-based cross-trigger backdoor attack against large language models. In Proceedings of the 34th USENIX Conference on Security Symposium , SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6

  24. [25]

    BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models

    Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, and Ruoxi Jia. BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 13189–13215, Miami, Florida, US...

  25. [26]

    Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs

    Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=utvu4PJ0Ct

  26. [27]

    BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models

    Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025. URL https://openreview.net/forum? id=sYLiY87mNn

  27. [28]

    ELBA-bench: An efficient learning backdoor attacks benchmark for large language models

    Xuxu Liu, Siyuan Liang, Mengya Han, et al. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17928–17947, Vienn...

  28. [29]

    Badedit: Backdooring large language models by model editing

    Yanzhou Li, Tianlin Li, Kangjie Chen, et al. Badedit: Backdooring large language models by model editing. In The Twelfth Interna- tional Conference on Learning Representations , 2024. URL https:// openreview.net/forum?id=duZANm2ABX

  29. [30]

    LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem

    Hongyi Liu, Shaochen Zhong, Xintong Sun, et al. LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2025 , pages 23009–23047, Suzhou, China, Nov. 2025. Association for Computatio...

  30. [31]

    SaloRA: Safety-alignment preserved low-rank adaptation

    Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaloRA: Safety-alignment preserved low-rank adaptation. In The Thirteenth International Conference on Learning Representations ,

  31. [32]

    URL https://openreview.net/forum?id=GOoVzE9nSj

  32. [33]

    Probe before you talk: Towards black-box defense against backdoor unalignment for large language models

    Biao Yi, Tiansheng Huang, Sishuo Chen, et al. Probe before you talk: Towards black-box defense against backdoor unalignment for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=EbxYDBhE3S

  33. [34]

    Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack

    Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, and Ling Liu. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack. In Forty-second International Conference on Machine Learning , 2025. URL https://openreview.net/forum?id=Arepl4R86m

  34. [35]

    Weight poisoning attacks on pretrained models

    Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2793– 2806, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2...

  35. [36]

    Backdoor attacks on pre-trained models by layerwise weight poisoning

    Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing , pages 3023–3032, Online and Pun...

  36. [37]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533

  37. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin R. Stone, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/ abs/2307.09288. Preprint posted online July 18, 2023

  38. [39]

    Exploiting LLM quantization

    Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Mar- tin Vechev. Exploiting LLM quantization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 41709–41732, Red Hook, NY, 2024. Curran Associates, Inc. doi: 10. 52202/079017-1319

  39. [40]

    Finetuning-activated backdoors in LLMs

    Thibaud Gloaguen, Mark Vero, Robin Staab, and Martin Vechev. Finetuning-activated backdoors in LLMs. In ICML 2025 Workshop on Reliable and Responsible Foundation Models , 2025. URL https:// openreview.net/forum?id=VPFq7otjIc

  40. [41]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of ICML’17, pages 1126–1135, Sydney, Australia, 2017. PMLR

  41. [42]

    Truth serum: Poisoning machine learning models to reveal their secrets

    Florian Tramèr, Reza Shokri, Ayrton San Joaquin, et al. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , CCS ’22, page 27792792, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394505. doi: 10.1145/3548606.3560554. UR...

  42. [43]

    In: 2022 IEEE Symposium on Security and Privacy (SP), pp

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP) , pages 1897–1914, 2022. doi: 10.1109/SP46214.2022.9833649

  43. [44]

    Privacy backdoors: Enhancing membership inference through poisoning pre-trained models

    Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Sys- tems, volume 37, pages 83374–83396,...

  44. [45]

    Learning trans- ferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning trans- ferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Ma- chine Learning Research , pages 8748–8763, Virtual, 18–24 Jul 2021. PMLR

  45. [46]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, et al. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE

  46. [47]

    Immunization against harmful fine-tuning attacks

    Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, and Frank Rudzicz. Immunization against harmful fine-tuning attacks. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 5234–5247, Miami, Florida, USA, Nov. 2024. Association for Computation...

  47. [48]

    Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates

    Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems , v...

  48. [49]

    Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

    Jiongxiao Wang, Jiazhao Li, Yiquan Li, et al. Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=1PcJ5Evta7

  49. [50]

    Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

    Guozhi Liu, Weiwei Lin, Qi Mu, et al. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 20:10806–10817, 2025. doi: 10.1109/TIFS.2025.3615412

  50. [51]

    Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation. In The Thir- teenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=tTPHgb0EtV

  51. [52]

    BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

    Jiaming Ji, Mickel Liu, Josef Dai, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 24678–24704, Red Hook, NY, 2023. Curran Associates, Inc

  52. [53]

    Direct preference optimiza- tion: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Man- ning, Stefano Ermon, and Chelsea Finn. Direct preference optimiza- tion: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems , volume 36, pages 53728–53741, Red H...

  53. [54]

    Self-destructing models: Increasing the costs of harmful dual uses of foundation models

    Peter Henderson, Eric Mitchell, Christopher Manning, Dan Ju- rafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’23, page 287296, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702310. ...

  54. [55]

    Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks

    Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, et al. Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks. In The Twelfth International Conference on Learning Represen- tations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl

  55. [56]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, et al. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083, Red Hook, NY, USA, 2024. Curran Associates, Inc. doi: 10.52202/079017-4322

  56. [57]

    On evaluating the durability of safeguards for open-weight llms

    Xiangyu Qi, Boyi Wei, Nicholas Carlini, et al. On evaluating the durability of safeguards for open-weight llms. CoRR, abs/2412.07097,

  57. [58]

    URL https://doi.org/10.48550/arXiv.2412.07097

  58. [59]

    Evaluating de- fences against unsafe feedback in rlhf

    Domenic Rosati, Giles Edkins, Harsh Raj, et al. Evaluating de- fences against unsafe feedback in rlhf. 2024. URL https://api. semanticscholar.org/CorpusID:272753495

  59. [60]

    Bach, Victor Sanh, Zheng-Xin Yong, et al

    Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, et al. Prompt- Source: An integrated development environment and repository for natural language prompts. In Valerio Basile, Zornitsa Kozareva, and Sanja Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Irela...

  60. [61]

    Cross-task generalization via natural language crowd- sourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowd- sourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3470–3487, Dublin...

  61. [62]

    Hugging Face Hub, 2025

    Hugging Face. Hugging Face Hub, 2025. URL https:// huggingface.co/. Accessed: 2025-12-01

  62. [63]

    Mind the style of text! adversarial and backdoor attacks based on text style transfer

    Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580,...

  63. [64]

    Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

    Xiaoyi Chen, Ahmed Salem, Dingfan Chen, et al. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554569, New York, NY, USA, 2021. As- sociation for Computing Machinery. ISBN 9781450385794. doi: 10.1145/3485832.3485837

  64. [65]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indi- rect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph En- dres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indi- rect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , AISec ’23, page 7990, New York, NY, USA, 2023. Association...

  65. [66]

    Shadow alignment: The ease of subverting safely-aligned language models, 2024

    Xianjun Yang, Xiao Wang, Qi Zhang, et al. Shadow alignment: The ease of subverting safely-aligned language models, 2024. URL https://openreview.net/forum?id=rg0vQmkB7F

  66. [67]

    Bloom: A 176b-parameter open- access multilingual language model

    Teven {Le Scao}, Christopher Akiki, Angela Fan, Ellie Pavlick, Francesco {De Toni}, and Suzana Ilić. Bloom: A 176b-parameter open- access multilingual language model. Workingpaper, MIT Press, Nov. 2022

  67. [68]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  68. [69]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  69. [70]

    Cl-attack: Textual backdoor attacks via cross-lingual triggers

    Jingyi Zheng, Tianyi Hu, Tianshuo Cong, and Xinlei He. Cl-attack: Textual backdoor attacks via cross-lingual triggers. Proceedings of the AAAI Conference on Artificial Intelligence , 39(25):26427–26435, Apr

  70. [71]

    URL https://ojs.aaai.org/index

    doi: 10.1609/aaai.v39i25.34842. URL https://ojs.aaai.org/index. php/AAAI/article/view/34842

  71. [72]

    xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning

    Linzheng Chai, Jian Yang, Tao Sun, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. In AAAI Conference on Artificial Intelligence , 2024. URL https://api. semanticscholar.org/CorpusID:266999425

  72. [73]

    Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages

    Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanx- iang Che. Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing , pages 2695–2709, Singapore, Dec. 2023. Association...

  73. [74]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024a. URL https://arxiv. org/abs/2407.10671, 6, 2023

  74. [75]

    Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

    Zihan Wang, Hongwei Li, Rui Zhang, et al. Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

  75. [76]

    React: Synergizing rea- soning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing rea- soning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

  76. [77]

    The rise and poten- tial of large language model based agents: a survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, et al. The rise and poten- tial of large language model based agents: a survey. Science China Information Sciences, 68(2):121101, Jan 2025. ISSN 1869-1919. doi: 10.1007/s11432-024-4222-0

  77. [78]

    BadAgent: Inserting and activating backdoor attacks in LLM agents

    Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thai- land, Aug. 2024. Associa...

  78. [79]

    Watch out for your agents! investigating backdoor threats to llm-based agents

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural In- formation Processing Systems, volume 37, pages 100938–100964, Red Hook, NY, 2024. Curran...

  79. [80]

    Silent sabotage: Injecting backdoors into AI agents through fine- tuning

    Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, et al. Silent sabotage: Injecting backdoors into AI agents through fine- tuning. In ICML 2025 Workshop on Computer Use Agents, 2025

  80. [81]

    Shikhar Murty, Dzmitry Bahdanau, and Christopher D. Manning. Nnetnav: Unsupervised learning of browser agents through environ- ment interaction in the wild. 2024. URL https://api.semanticscholar. org/CorpusID:273162280

Showing first 80 references.