pith. sign in

arxiv: 2409.18169 · v6 · submitted 2024-09-26 · 💻 cs.CR · cs.AI· cs.LG

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Pith reviewed 2026-05-23 20:55 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords harmful fine-tuningLLM safetythreat modelattacksdefensesevaluation methodologyfine-tuning servicessafety alignment
0
0 comments X

The pith

Harmful fine-tuning can undo the safety alignments of large language models using only a small amount of harmful user data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates a threat model for harmful fine-tuning attacks, in which fine-tuning-as-a-service lets users upload data that includes harmful examples and thereby breaks prior safety alignments. It then reviews existing attacks and their variants, defense methods, mechanical analyses of how safety degrades, and evaluation approaches. A sympathetic reader would care because fine-tuning services are an emerging business model that directly affects the reliability of deployed models. If the formulation is right, the field gains a shared structure for comparing methods and planning next steps.

Core claim

The authors claim that harmful fine-tuning attacks represent a concrete safety risk to aligned large language models because a few harmful examples suffice to degrade safety, and they address the risk by first stating the threat model and basic assumptions, then surveying representative attacks, defense designs, mechanical analyses of adverse effects, and evaluation methodologies while listing future research directions.

What carries the argument

The harmful fine-tuning threat model, which defines the attack setting, assumptions about user-uploaded data, and how safety alignments are compromised during fine-tuning.

If this is right

  • Multiple variants of the attack exist depending on the concrete attack setting and adversary goals.
  • Defense methods can be designed to intervene at different stages of the fine-tuning process.
  • Mechanical analyses of how safety is lost can guide the creation of targeted defenses.
  • Standardized evaluation methodologies allow consistent measurement of attack success and defense strength.
  • The outlined future directions provide concrete starting points for subsequent research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same threat model structure could be adapted to other user-driven customization interfaces beyond fine-tuning.
  • Fine-tuning service providers might adopt elements of the reviewed defenses as default safeguards.
  • The curated collection of papers offers a practical way to monitor how the research area evolves.
  • Interactions between harmful fine-tuning and other safety threats such as data poisoning could be examined next.

Load-bearing premise

The papers selected for review and the threat model they support are representative of the full scope of harmful fine-tuning phenomena.

What would settle it

An empirical demonstration of a harmful fine-tuning attack that succeeds against all reviewed defenses yet falls outside the stated threat model assumptions.

Figures

Figures reproduced from arXiv: 2409.18169 by Fatih Ilhan, Ling Liu, Selim Furkan Tekin, Sihao Hu, Tiansheng Huang.

Figure 1
Figure 1. Figure 1: Illustration of harmful fine-tuning attack. Step I: user uploads partial harmful data to the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of two-sage pipeline for fine [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Harmful score/fine-tune accuracy of SFT/non-aligned model after finetuning on SST2 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Alignment loss/embedding drift of SFT/non-aligned model finetuned on SST2 mixed with [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: T-SNE visualization of hidden embedding drift under different harmful ratios [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model statistics (Left: harmful score, Middle: harmful training loss, Right: harmful [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mind map for the existing literature on harmful fine-tuning attacks. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of four types of fine-tuning stage defense. (A) [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce the evaluation methodology and outline future research directions, which can serve as guidelines and crucial perspectives for the future development of the subject. We also maintain a curated list of relevant papers, which are made accessible at https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The paper claims to systematically formulate the threat model and basic assumptions of harmful fine-tuning attacks on LLMs (where user-uploaded harmful data can compromise safety alignment), then deliver a comprehensive review structured around three perspectives—attack setting, defense design, and evaluation methodology—covering representative attacks, defenses, mechanical analyses of adverse effects, evaluation methods, and future directions, while maintaining an updatable GitHub repository of relevant papers.

Significance. If the threat-model formulation is accurate and the literature coverage representative, the work would serve as a useful archival and organizational reference for the emerging area of LLM safety against harmful fine-tuning. The explicit maintenance of a curated paper list is a concrete strength that supports ongoing community use and future guideline development.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and the recommendation to accept. We appreciate the recognition of the threat-model formulation, literature coverage, and the value of the maintained GitHub repository as a community resource.

Circularity Check

0 steps flagged

No significant circularity in survey formulation or review

full rationale

This is a literature survey paper whose central contribution is an organizational threat-model formulation plus a review of external attacks, defenses, and evaluation methods drawn from the cited literature. No equations, fitted parameters, predictions, or derivations appear in the provided abstract or structure. The threat model is presented as a synthesis of existing work rather than a self-referential construction. Any self-citations (if present) are not load-bearing for the core claims, which remain archival and organizational. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it relies on the prior literature it cites for all technical content.

pith-pipeline@v0.9.0 · 5730 in / 1000 out tokens · 20616 ms · 2026-05-23T20:55:59.634569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...

  2. Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

    cs.CV 2026-04 unverdicted novelty 7.0

    GaussLock embeds traps targeting position, scale, rotation, opacity, and color in 3D Gaussian models to degrade unauthorized fine-tunes while preserving authorized performance.

  3. Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 7.0

    TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

  4. Alignment Dynamics in LLM Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

  5. Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

    physics.optics 2026-05 unverdicted novelty 6.0

    JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.

  6. Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

    cs.CR 2026-05 unverdicted novelty 6.0

    A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

  7. Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

    cs.AI 2026-04 unverdicted novelty 6.0

    Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

  8. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  9. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

  10. Secure LLM Fine-Tuning via Safety-Aware Probing

    cs.LG 2025-05 unverdicted novelty 6.0

    SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

Reference graph

Works this paper leans on

185 extracted references · 185 canonical work pages · cited by 10 Pith papers · 35 internal anchors

  1. [1]

    Identifying and tuning safety neurons in large language models

    Anonymous. Identifying and tuning safety neurons in large language models. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=yR47RmND1m. under review

  2. [2]

    Measuring the contribution of fine-tuning to individual responses of LLMs

    Anonymous. Measuring the contribution of fine-tuning to individual responses of LLMs. In Submitted to The Thirteenth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=3VD92FuNCd. under review

  3. [3]

    On evaluating the durability of safeguards for open-weight LLMs

    Anonymous. On evaluating the durability of safeguards for open-weight LLMs. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=fXJCqdUSVG. under review

  4. [4]

    Safety alignment shouldn’t be complicated

    Anonymous. Safety alignment shouldn’t be complicated. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=9H91juqfgb. under review

  5. [5]

    SaloRA: Safety-alignment preserved low-rank adaptation

    Anonymous. SaloRA: Safety-alignment preserved low-rank adaptation. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=GOoVzE9nSj. under review

  6. [6]

    Unraveling and mitigating safety alignment degradation of vision-language models

    Anonymous. Unraveling and mitigating safety alignment degradation of vision-language models. In Submitted to The Thirteenth International Conference on Learning Representations,

  7. [7]

    under review

    URL https://openreview.net/forum?id=EEWpE9cR27. under review

  8. [8]

    Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs

    Anonymous. Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs. In Submitted to The Thirteenth International Confer- ence on Learning Representations , 2024. URL https://openreview.net/forum?id= vQ0zFYJaMo. under review

  9. [9]

    Ayyamperumal, S. G. and Ge, L. Current state of llm risks and ai guardrails. arXiv preprint arXiv:2406.12934, 2024

  10. [10]

    How to backdoor federated learning

    Bagdasaryan, E., Veit, A., Hua, Y ., Estrin, D., and Shmatikov, V . How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2938–2948. PMLR, 2020

  11. [11]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  12. [12]

    Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

    Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023

  13. [13]

    What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp

    Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 2280–2292, 2022

  14. [14]

    N., Wu, Y ., Rocamora, E

    Candogan, L. N., Wu, Y ., Rocamora, E. A., Chrysos, G., and Cevher, V . Single-pass detection of jailbreaking input in large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

  15. [15]

    Defending against alignment-breaking attacks via robustly aligned llm

    Cao, B., Cao, Y ., Lin, L., and Chen, J. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023

  16. [16]

    Extracting training data from large language models

    Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021

  17. [17]

    Quantifying Memorization Across Neural Language Models

    Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. 19

  18. [18]

    ArXiv:2403.05030 [cs]

    Casper, S., Schulze, L., Patel, O., and Hadfield-Menell, D. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024

  19. [19]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023

  20. [20]

    The dark side of human feedback: Poisoning large language models via user inputs

    Chen, B., Guo, H., Wang, G., Wang, Y ., and Yan, Q. The dark side of human feedback: Poisoning large language models via user inputs. arXiv preprint arXiv:2409.00787, 2024

  21. [21]

    Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

    Chen, C., Huang, B., Li, Z., Chen, Z., Lai, S., Xu, X., Gu, J.-C., Gu, J., Yao, H., Xiao, C., et al. Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

  22. [22]

    Oml: Open, monetizable, and loyal ai

    Cheng, Z., Contente, E., Finch, B., Golev, O., Hayase, J., Miller, A., Moshrefi, N., Nasery, A., Nailwal, S., Oh, S., et al. Oml: Open, monetizable, and loyal ai. Cryptology ePrint Archive, 2024

  23. [23]

    E., Stoica, I., and Xing, E

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/

  24. [24]

    K., Du, X., and Li, Y

    Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. arXiv preprint arXiv:2410.10014, 2024

  25. [25]

    How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chou, S.-Y ., Chen, P.-Y ., and Ho, T.-Y . How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024, 2023

  26. [26]

    G., Islam, M

    Chowdhury, A. G., Islam, M. M., Kumar, V ., Shezan, F. H., Jain, V ., and Chadha, A. Breaking down the defenses: A comparative survey of attacks on large language models. arXiv preprint arXiv:2403.04786, 2024

  27. [27]

    Comprehensive assessment of jailbreak attacks against llms

    Chu, J., Liu, Y ., Yang, Z., Shen, X., Backes, M., and Zhang, Y . Comprehensive assessment of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668, 2024

  28. [28]

    Ai safety in generative ai large language models: A survey

    Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv preprint arXiv:2407.18369, 2024

  29. [29]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

  30. [30]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  31. [31]

    Recent advances in attack and defense approaches of large language models

    Cui, J., Xu, Y ., Huang, Z., Zhou, S., Jiao, J., and Zhang, J. Recent advances in attack and defense approaches of large language models. arXiv preprint arXiv:2409.03274, 2024

  32. [32]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023

  33. [33]

    Hadi Amini, and Yanzhao Wu

    Das, B. C., Amini, M. H., and Wu, Y . Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024

  34. [34]

    Das, N., Peng, S., and Chau, D. H. Skelevision: Towards adversarial resiliency of person tracking with multi-task learning. In European Conference on Computer Vision, pp. 449–466. Springer, 2022

  35. [35]

    Qlora: Efficient finetuning of quantized llms

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024

  36. [36]

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    Dong, H., Xiong, W., Goyal, D., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023. 20

  37. [37]

    Safeguarding large language models: A survey

    Dong, Y ., Mu, R., Zhang, Y ., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y ., Hu, J., Meng, J., et al. Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622, 2024

  38. [38]

    Attacks, defenses and evaluations for llm conversation safety: A survey

    Dong, Z., Zhou, Z., Yang, C., Shao, J., and Qiao, Y . Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024

  39. [39]

    Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning

    Du, Y ., Zhao, S., Cao, J., Ma, M., Zhao, D., Fan, F., Liu, T., and Qin, B. Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning. arXiv preprint arXiv:2410.04524, 2024

  40. [40]

    H., Kumar, M

    Eiras, F., Petrov, A., Torr, P. H., Kumar, M. P., and Bibi, A. Mimicking user data: On mitigating fine-tuning risks in closed large language models. arXiv preprint arXiv:2406.10288, 2024

  41. [41]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  42. [42]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

  43. [43]

    G., Andersen, T., and Zhuang, J

    Geren, C., Board, A., Dagher, G. G., Andersen, T., and Zhuang, J. Blockchain for large language model security and safety: A holistic survey. arXiv preprint arXiv:2407.20181, 2024

  44. [44]

    Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

    Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019

  45. [45]

    L., and Thomaz, A

    Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013

  46. [46]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017

  47. [47]

    The vllm safety paradox: Dual ease in jailbreak attack and defense

    Guo, Y ., Jiao, F., Nie, L., and Kankanhalli, M. The vllm safety paradox: Dual ease in jailbreak attack and defense. arXiv preprint arXiv:2411.08410, 2024

  48. [48]

    Regulating chatgpt and other large generative ai models

    Hacker, P., Engel, A., and Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1112–1123, 2023

  49. [49]

    T., Haghtalab, N., and Steinhardt, J

    Halawi, D., Wei, A., Wallace, E., Wang, T. T., Haghtalab, N., and Steinhardt, J. Covert mali- cious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024

  50. [50]

    The effect of fine-tuning on language model toxicity

    Hawkins, W., Mittelstadt, B., and Russell, C. The effect of fine-tuning on language model toxicity. arXiv preprint arXiv:2410.15821, 2024

  51. [51]

    He, F., Zhu, T., Ye, D., Liu, B., Zhou, W., and Yu, P. S. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

  52. [52]

    What’s in your" safe" data?: Identifying benign data that breaks safety

    He, L., Xia, M., and Henderson, P. What’s in your" safe" data?: Identifying benign data that breaks safety. arXiv preprint arXiv:2404.01099, 2024

  53. [53]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020

  54. [54]

    Safe lora: the silver lining of reducing safety risks when fine-tuning large language models

    Hsu, C.-Y ., Tsai, Y .-L., Lin, C.-H., Chen, P.-Y ., Yu, C.-M., and Huang, C.-Y . Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. arXiv preprint arXiv:2405.16833, 2024

  55. [55]

    Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

    Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019

  56. [56]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 21

  57. [57]

    F., and Liu, L

    Hu, S., Huang, T., ˙Ilhan, F., Tekin, S. F., and Liu, L. Large language model-powered smart contract vulnerability detection: New perspectives. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 297–306. IEEE, 2023

  58. [58]

    Bert4eth: A pre-trained transformer for ethereum fraud detection

    Hu, S., Zhang, Z., Luo, B., Lu, S., He, B., and Liu, L. Bert4eth: A pre-trained transformer for ethereum fraud detection. In Proceedings of the ACM Web Conference 2023, pp. 2189–2197, 2023

  59. [59]

    Zipzap: Efficient training of language models for large-scale fraud detection on blockchain

    Hu, S., Huang, T., Chow, K.-H., Wei, W., Wu, Y ., and Liu, L. Zipzap: Efficient training of language models for large-scale fraud detection on blockchain. In Proceedings of the ACM on Web Conference 2024, pp. 2807–2816, 2024

  60. [60]

    A survey on large language model-based game agents

    Hu, S., Huang, T., Ilhan, F., Tekin, S., Liu, G., Kompella, R., and Liu, L. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

  61. [61]

    Composite backdoor attacks against large language models

    Huang, H., Zhao, Z., Backes, M., Shen, Y ., and Zhang, Y . Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023

  62. [62]

    Achieving personalized federated learning with sparse local models

    Huang, T., Liu, S., Shen, L., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. arXiv preprint arXiv:2201.11380, 2022

  63. [63]

    Fusion of global and local knowledge for personalized federated learning

    Huang, T., Shen, L., Sun, Y ., Lin, W., and Tao, D. Fusion of global and local knowledge for personalized federated learning. arXiv preprint arXiv:2302.11051, 2023

  64. [64]

    Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning

    Huang, T., Bhattacharya, G., Joshi, P., Kimball, J., and Liu, L. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2408.09600, 2024

  65. [65]

    Lockdown: backdoor defense for federated learning with isolated subspace training

    Huang, T., Hu, S., Chow, K.-H., Ilhan, F., Tekin, S., and Liu, L. Lockdown: backdoor defense for federated learning with isolated subspace training. Advances in Neural Information Processing Systems, 36, 2024

  66. [66]

    F., and Liu, L

    Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024

  67. [67]

    F., and Liu, L

    Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.18641, 2024

  68. [68]

    Vaccine: Perturbation-aware alignment for large language model

    Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024

  69. [69]

    A survey of safety and trustworthiness of large language models through the lens of verification and validation

    Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y ., Wu, C., Bensalem, S., Mu, R., Qi, Y ., Zhao, X., et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024

  70. [70]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

  71. [71]

    Adaptive deep neural network inference optimization with eenet

    Ilhan, F., Chow, K.-H., Hu, S., Huang, T., Tekin, S., Wei, W., Wu, Y ., Lee, M., Kompella, R., Latapie, H., et al. Adaptive deep neural network inference optimization with eenet. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pp. 1373–1382, 2024

  72. [72]

    F., Huang, T., Hu, S., and Liu, L

    Ilhan, F., Su, G., Tekin, S. F., Huang, T., Hu, S., and Liu, L. Resource-efficient transformer pruning for finetuning of large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16206–16215, 2024

  73. [73]

    Measuring forgetting of memorized training examples

    Jagielski, M., Thakkar, O., Tramer, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., et al. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022. 22

  74. [74]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Jain, N., Schwarzschild, A., Wen, Y ., Somepalli, G., Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A., Geiping, J., and Goldstein, T. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

  75. [75]

    S., Dick, R

    Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rocktäschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

  76. [76]

    S., Oksuz, K., Joy, T., Torr, P

    Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P. H., Sanyal, A., and Dokania, P. K. What makes and breaks safety fine-tuning? mechanistic study. arXiv preprint arXiv:2407.10264, 2024

  77. [77]

    Active data pattern extraction attacks on generative language models

    Jayaraman, B., Ghosh, E., Inan, H., Chase, M., Roy, S., and Dai, W. Active data pattern extraction attacks on generative language models. arXiv preprint arXiv:2207.10802, 2022

  78. [78]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023

    Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

  79. [79]

    Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

    Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., and Wang, H. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024

  80. [80]

    Deduplicating training data mitigates privacy risks in language models

    Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022

Showing first 80 references.