Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Pith reviewed 2026-05-23 20:55 UTC · model grok-4.3
The pith
Harmful fine-tuning can undo the safety alignments of large language models using only a small amount of harmful user data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that harmful fine-tuning attacks represent a concrete safety risk to aligned large language models because a few harmful examples suffice to degrade safety, and they address the risk by first stating the threat model and basic assumptions, then surveying representative attacks, defense designs, mechanical analyses of adverse effects, and evaluation methodologies while listing future research directions.
What carries the argument
The harmful fine-tuning threat model, which defines the attack setting, assumptions about user-uploaded data, and how safety alignments are compromised during fine-tuning.
If this is right
- Multiple variants of the attack exist depending on the concrete attack setting and adversary goals.
- Defense methods can be designed to intervene at different stages of the fine-tuning process.
- Mechanical analyses of how safety is lost can guide the creation of targeted defenses.
- Standardized evaluation methodologies allow consistent measurement of attack success and defense strength.
- The outlined future directions provide concrete starting points for subsequent research.
Where Pith is reading between the lines
- The same threat model structure could be adapted to other user-driven customization interfaces beyond fine-tuning.
- Fine-tuning service providers might adopt elements of the reviewed defenses as default safeguards.
- The curated collection of papers offers a practical way to monitor how the research area evolves.
- Interactions between harmful fine-tuning and other safety threats such as data poisoning could be examined next.
Load-bearing premise
The papers selected for review and the threat model they support are representative of the full scope of harmful fine-tuning phenomena.
What would settle it
An empirical demonstration of a harmful fine-tuning attack that succeeds against all reviewed defenses yet falls outside the stated threat model assumptions.
Figures
read the original abstract
Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce the evaluation methodology and outline future research directions, which can serve as guidelines and crucial perspectives for the future development of the subject. We also maintain a curated list of relevant papers, which are made accessible at https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to systematically formulate the threat model and basic assumptions of harmful fine-tuning attacks on LLMs (where user-uploaded harmful data can compromise safety alignment), then deliver a comprehensive review structured around three perspectives—attack setting, defense design, and evaluation methodology—covering representative attacks, defenses, mechanical analyses of adverse effects, evaluation methods, and future directions, while maintaining an updatable GitHub repository of relevant papers.
Significance. If the threat-model formulation is accurate and the literature coverage representative, the work would serve as a useful archival and organizational reference for the emerging area of LLM safety against harmful fine-tuning. The explicit maintenance of a curated paper list is a concrete strength that supports ongoing community use and future guideline development.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the manuscript and the recommendation to accept. We appreciate the recognition of the threat-model formulation, literature coverage, and the value of the maintained GitHub repository as a community resource.
Circularity Check
No significant circularity in survey formulation or review
full rationale
This is a literature survey paper whose central contribution is an organizational threat-model formulation plus a review of external attacks, defenses, and evaluation methods drawn from the cited literature. No equations, fitted parameters, predictions, or derivations appear in the provided abstract or structure. The threat model is presented as a synthesis of existing work rather than a self-referential construction. Any self-citations (if present) are not load-bearing for the core claims, which remain archival and organizational. This matches the default expectation for non-circular survey papers.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 10 Pith papers
-
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
-
Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps
GaussLock embeds traps targeting position, scale, rotation, opacity, and color in 3D Gaussian models to degrade unauthorized fine-tunes while preserving authorized performance.
-
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
-
Alignment Dynamics in LLM Fine-Tuning
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
-
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
Secure LLM Fine-Tuning via Safety-Aware Probing
SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.
Reference graph
Works this paper leans on
-
[1]
Identifying and tuning safety neurons in large language models
Anonymous. Identifying and tuning safety neurons in large language models. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=yR47RmND1m. under review
work page 2024
-
[2]
Measuring the contribution of fine-tuning to individual responses of LLMs
Anonymous. Measuring the contribution of fine-tuning to individual responses of LLMs. In Submitted to The Thirteenth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=3VD92FuNCd. under review
work page 2024
-
[3]
On evaluating the durability of safeguards for open-weight LLMs
Anonymous. On evaluating the durability of safeguards for open-weight LLMs. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=fXJCqdUSVG. under review
work page 2024
-
[4]
Safety alignment shouldn’t be complicated
Anonymous. Safety alignment shouldn’t be complicated. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=9H91juqfgb. under review
work page 2024
-
[5]
SaloRA: Safety-alignment preserved low-rank adaptation
Anonymous. SaloRA: Safety-alignment preserved low-rank adaptation. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=GOoVzE9nSj. under review
work page 2024
-
[6]
Unraveling and mitigating safety alignment degradation of vision-language models
Anonymous. Unraveling and mitigating safety alignment degradation of vision-language models. In Submitted to The Thirteenth International Conference on Learning Representations,
- [7]
-
[8]
Anonymous. Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs. In Submitted to The Thirteenth International Confer- ence on Learning Representations , 2024. URL https://openreview.net/forum?id= vQ0zFYJaMo. under review
work page 2024
- [9]
-
[10]
How to backdoor federated learning
Bagdasaryan, E., Veit, A., Hua, Y ., Estrin, D., and Shmatikov, V . How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2938–2948. PMLR, 2020
work page 2020
-
[11]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023
-
[13]
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 2280–2292, 2022
work page 2022
-
[14]
Candogan, L. N., Wu, Y ., Rocamora, E. A., Chrysos, G., and Cevher, V . Single-pass detection of jailbreaking input in large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models
work page 2024
-
[15]
Defending against alignment-breaking attacks via robustly aligned llm
Cao, B., Cao, Y ., Lin, L., and Chen, J. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023
-
[16]
Extracting training data from large language models
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021
work page 2021
-
[17]
Quantifying Memorization Across Neural Language Models
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. 19
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Casper, S., Schulze, L., Patel, O., and Hadfield-Menell, D. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024
-
[19]
Jailbreaking Black Box Large Language Models in Twenty Queries
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
The dark side of human feedback: Poisoning large language models via user inputs
Chen, B., Guo, H., Wang, G., Wang, Y ., and Yan, Q. The dark side of human feedback: Poisoning large language models via user inputs. arXiv preprint arXiv:2409.00787, 2024
-
[21]
Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024
Chen, C., Huang, B., Li, Z., Chen, Z., Lai, S., Xu, X., Gu, J.-C., Gu, J., Yao, H., Xiao, C., et al. Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024
-
[22]
Oml: Open, monetizable, and loyal ai
Cheng, Z., Contente, E., Finch, B., Golev, O., Hayase, J., Miller, A., Moshrefi, N., Nasery, A., Nailwal, S., Oh, S., et al. Oml: Open, monetizable, and loyal ai. Cryptology ePrint Archive, 2024
work page 2024
-
[23]
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/
work page 2023
-
[24]
Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. arXiv preprint arXiv:2410.10014, 2024
-
[25]
Chou, S.-Y ., Chen, P.-Y ., and Ho, T.-Y . How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024, 2023
work page 2023
-
[26]
Chowdhury, A. G., Islam, M. M., Kumar, V ., Shezan, F. H., Jain, V ., and Chadha, A. Breaking down the defenses: A comparative survey of attacks on large language models. arXiv preprint arXiv:2403.04786, 2024
-
[27]
Comprehensive assessment of jailbreak attacks against llms
Chu, J., Liu, Y ., Yang, Z., Shen, X., Backes, M., and Zhang, Y . Comprehensive assessment of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668, 2024
-
[28]
Ai safety in generative ai large language models: A survey
Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv preprint arXiv:2407.18369, 2024
-
[29]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Recent advances in attack and defense approaches of large language models
Cui, J., Xu, Y ., Huang, Z., Zhou, S., Jiao, J., and Zhang, J. Recent advances in attack and defense approaches of large language models. arXiv preprint arXiv:2409.03274, 2024
-
[32]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Das, B. C., Amini, M. H., and Wu, Y . Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024
-
[34]
Das, N., Peng, S., and Chau, D. H. Skelevision: Towards adversarial resiliency of person tracking with multi-task learning. In European Conference on Computer Vision, pp. 449–466. Springer, 2022
work page 2022
-
[35]
Qlora: Efficient finetuning of quantized llms
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[36]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Dong, H., Xiong, W., Goyal, D., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023. 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Safeguarding large language models: A survey
Dong, Y ., Mu, R., Zhang, Y ., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y ., Hu, J., Meng, J., et al. Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622, 2024
-
[38]
Attacks, defenses and evaluations for llm conversation safety: A survey
Dong, Z., Zhou, Z., Yang, C., Shao, J., and Qiao, Y . Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024
-
[39]
Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning
Du, Y ., Zhao, S., Cao, J., Ma, M., Zhao, D., Fan, F., Liu, T., and Qin, B. Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning. arXiv preprint arXiv:2410.04524, 2024
-
[40]
Eiras, F., Petrov, A., Torr, P. H., Kumar, M. P., and Bibi, A. Mimicking user data: On mitigating fine-tuning risks in closed large language models. arXiv preprint arXiv:2406.10288, 2024
-
[41]
KTO: Model Alignment as Prospect Theoretic Optimization
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[43]
G., Andersen, T., and Zhuang, J
Geren, C., Board, A., Dagher, G. G., Andersen, T., and Zhuang, J. Blockchain for large language model security and safety: A holistic survey. arXiv preprint arXiv:2407.20181, 2024
-
[44]
Samsum corpus: A human-annotated dialogue dataset for abstractive summarization
Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019
-
[45]
Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013
work page 2013
-
[46]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
The vllm safety paradox: Dual ease in jailbreak attack and defense
Guo, Y ., Jiao, F., Nie, L., and Kankanhalli, M. The vllm safety paradox: Dual ease in jailbreak attack and defense. arXiv preprint arXiv:2411.08410, 2024
-
[48]
Regulating chatgpt and other large generative ai models
Hacker, P., Engel, A., and Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1112–1123, 2023
work page 2023
-
[49]
T., Haghtalab, N., and Steinhardt, J
Halawi, D., Wei, A., Wallace, E., Wang, T. T., Haghtalab, N., and Steinhardt, J. Covert mali- cious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024
-
[50]
The effect of fine-tuning on language model toxicity
Hawkins, W., Mittelstadt, B., and Russell, C. The effect of fine-tuning on language model toxicity. arXiv preprint arXiv:2410.15821, 2024
- [51]
-
[52]
What’s in your" safe" data?: Identifying benign data that breaks safety
He, L., Xia, M., and Henderson, P. What’s in your" safe" data?: Identifying benign data that breaks safety. arXiv preprint arXiv:2404.01099, 2024
-
[53]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[54]
Safe lora: the silver lining of reducing safety risks when fine-tuning large language models
Hsu, C.-Y ., Tsai, Y .-L., Lin, C.-H., Chen, P.-Y ., Yu, C.-M., and Huang, C.-Y . Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. arXiv preprint arXiv:2405.16833, 2024
-
[55]
Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification
Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[56]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 21
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Hu, S., Huang, T., ˙Ilhan, F., Tekin, S. F., and Liu, L. Large language model-powered smart contract vulnerability detection: New perspectives. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 297–306. IEEE, 2023
work page 2023
-
[58]
Bert4eth: A pre-trained transformer for ethereum fraud detection
Hu, S., Zhang, Z., Luo, B., Lu, S., He, B., and Liu, L. Bert4eth: A pre-trained transformer for ethereum fraud detection. In Proceedings of the ACM Web Conference 2023, pp. 2189–2197, 2023
work page 2023
-
[59]
Zipzap: Efficient training of language models for large-scale fraud detection on blockchain
Hu, S., Huang, T., Chow, K.-H., Wei, W., Wu, Y ., and Liu, L. Zipzap: Efficient training of language models for large-scale fraud detection on blockchain. In Proceedings of the ACM on Web Conference 2024, pp. 2807–2816, 2024
work page 2024
-
[60]
A survey on large language model-based game agents
Hu, S., Huang, T., Ilhan, F., Tekin, S., Liu, G., Kompella, R., and Liu, L. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024
-
[61]
Composite backdoor attacks against large language models
Huang, H., Zhao, Z., Backes, M., Shen, Y ., and Zhang, Y . Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023
-
[62]
Achieving personalized federated learning with sparse local models
Huang, T., Liu, S., Shen, L., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. arXiv preprint arXiv:2201.11380, 2022
-
[63]
Fusion of global and local knowledge for personalized federated learning
Huang, T., Shen, L., Sun, Y ., Lin, W., and Tao, D. Fusion of global and local knowledge for personalized federated learning. arXiv preprint arXiv:2302.11051, 2023
-
[64]
Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning
Huang, T., Bhattacharya, G., Joshi, P., Kimball, J., and Liu, L. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2408.09600, 2024
-
[65]
Lockdown: backdoor defense for federated learning with isolated subspace training
Huang, T., Hu, S., Chow, K.-H., Ilhan, F., Tekin, S., and Liu, L. Lockdown: backdoor defense for federated learning with isolated subspace training. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[66]
Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024
-
[67]
Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.18641, 2024
-
[68]
Vaccine: Perturbation-aware alignment for large language model
Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024
-
[69]
Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y ., Wu, C., Bensalem, S., Mu, R., Qi, Y ., Zhao, X., et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024
work page 2024
-
[70]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Adaptive deep neural network inference optimization with eenet
Ilhan, F., Chow, K.-H., Hu, S., Huang, T., Tekin, S., Wei, W., Wu, Y ., Lee, M., Kompella, R., Latapie, H., et al. Adaptive deep neural network inference optimization with eenet. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pp. 1373–1382, 2024
work page 2024
-
[72]
F., Huang, T., Hu, S., and Liu, L
Ilhan, F., Su, G., Tekin, S. F., Huang, T., Hu, S., and Liu, L. Resource-efficient transformer pruning for finetuning of large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16206–16215, 2024
work page 2024
-
[73]
Measuring forgetting of memorized training examples
Jagielski, M., Thakkar, O., Tramer, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., et al. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022. 22
-
[74]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Jain, N., Schwarzschild, A., Wen, Y ., Somepalli, G., Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A., Geiping, J., and Goldstein, T. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rocktäschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023
-
[76]
S., Oksuz, K., Joy, T., Torr, P
Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P. H., Sanyal, A., and Dokania, P. K. What makes and breaks safety fine-tuning? mechanistic study. arXiv preprint arXiv:2407.10264, 2024
-
[77]
Active data pattern extraction attacks on generative language models
Jayaraman, B., Ghosh, E., Inan, H., Chase, M., Roy, S., and Dai, W. Active data pattern extraction attacks on generative language models. arXiv preprint arXiv:2207.10802, 2022
-
[78]
Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023
Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023
-
[79]
Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., and Wang, H. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024
-
[80]
Deduplicating training data mitigates privacy risks in language models
Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.