Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Fatih Ilhan; Ling Liu; Selim Furkan Tekin; Sihao Hu; Tiansheng Huang

arxiv: 2409.18169 · v6 · submitted 2024-09-26 · 💻 cs.CR · cs.AI· cs.LG

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang , Sihao Hu , Fatih Ilhan , Selim Furkan Tekin , Ling Liu This is my paper

Pith reviewed 2026-05-23 20:55 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords harmful fine-tuningLLM safetythreat modelattacksdefensesevaluation methodologyfine-tuning servicessafety alignment

0 comments

The pith

Harmful fine-tuning can undo the safety alignments of large language models using only a small amount of harmful user data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates a threat model for harmful fine-tuning attacks, in which fine-tuning-as-a-service lets users upload data that includes harmful examples and thereby breaks prior safety alignments. It then reviews existing attacks and their variants, defense methods, mechanical analyses of how safety degrades, and evaluation approaches. A sympathetic reader would care because fine-tuning services are an emerging business model that directly affects the reliability of deployed models. If the formulation is right, the field gains a shared structure for comparing methods and planning next steps.

Core claim

The authors claim that harmful fine-tuning attacks represent a concrete safety risk to aligned large language models because a few harmful examples suffice to degrade safety, and they address the risk by first stating the threat model and basic assumptions, then surveying representative attacks, defense designs, mechanical analyses of adverse effects, and evaluation methodologies while listing future research directions.

What carries the argument

The harmful fine-tuning threat model, which defines the attack setting, assumptions about user-uploaded data, and how safety alignments are compromised during fine-tuning.

If this is right

Multiple variants of the attack exist depending on the concrete attack setting and adversary goals.
Defense methods can be designed to intervene at different stages of the fine-tuning process.
Mechanical analyses of how safety is lost can guide the creation of targeted defenses.
Standardized evaluation methodologies allow consistent measurement of attack success and defense strength.
The outlined future directions provide concrete starting points for subsequent research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same threat model structure could be adapted to other user-driven customization interfaces beyond fine-tuning.
Fine-tuning service providers might adopt elements of the reviewed defenses as default safeguards.
The curated collection of papers offers a practical way to monitor how the research area evolves.
Interactions between harmful fine-tuning and other safety threats such as data poisoning could be examined next.

Load-bearing premise

The papers selected for review and the threat model they support are representative of the full scope of harmful fine-tuning phenomena.

What would settle it

An empirical demonstration of a harmful fine-tuning attack that succeeds against all reviewed defenses yet falls outside the stated threat model assumptions.

Figures

Figures reproduced from arXiv: 2409.18169 by Fatih Ilhan, Ling Liu, Selim Furkan Tekin, Sihao Hu, Tiansheng Huang.

**Figure 2.** Figure 2: Illustration of two-sage pipeline for fine [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Harmful score/fine-tune accuracy of SFT/non-aligned model after finetuning on SST2 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Alignment loss/embedding drift of SFT/non-aligned model finetuned on SST2 mixed with [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: T-SNE visualization of hidden embedding drift under different harmful ratios [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Model statistics (Left: harmful score, Middle: harmful training loss, Right: harmful [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Mind map for the existing literature on harmful fine-tuning attacks. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of four types of fine-tuning stage defense. (A) [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce the evaluation methodology and outline future research directions, which can serve as guidelines and crucial perspectives for the future development of the subject. We also maintain a curated list of relevant papers, which are made accessible at https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid organizing survey on harmful fine-tuning that lays out the threat model and literature structure but adds no new attacks, defenses, or experiments.

read the letter

This survey formulates a threat model for harmful fine-tuning attacks on LLMs and reviews the existing work across attack settings, defense designs, and evaluation methods. It also maintains a public GitHub list of papers. That organizational framing and the three-perspective breakdown are the main contributions; they give newcomers a map of the area without claiming fresh technical results. The abstract shows a clear plan to cover variants of the attack, representative methods, mechanical analyses from prior papers, and open directions, which is useful for a fast-moving topic. The curated list is a practical addition that others can build on. As a survey it does not ship new data, proofs, or falsifiable predictions, so its value rests on coverage and accuracy of the summaries. The main soft spot is the usual one for literature reviews: selection of papers and depth of the mechanical explanations depend on the authors' choices, and any gaps in representativeness would only show up on close reading of the full text. No internal contradictions or unsupported derivations appear in the provided structure. This paper is for researchers and engineers who need a starting point on LLM fine-tuning safety rather than for those seeking novel mechanisms. It is coherent on its own terms and engages the literature directly, so it deserves a serious referee even if the verdict after review is that it stays a survey.

Referee Report

0 major / 0 minor

Summary. The paper claims to systematically formulate the threat model and basic assumptions of harmful fine-tuning attacks on LLMs (where user-uploaded harmful data can compromise safety alignment), then deliver a comprehensive review structured around three perspectives—attack setting, defense design, and evaluation methodology—covering representative attacks, defenses, mechanical analyses of adverse effects, evaluation methods, and future directions, while maintaining an updatable GitHub repository of relevant papers.

Significance. If the threat-model formulation is accurate and the literature coverage representative, the work would serve as a useful archival and organizational reference for the emerging area of LLM safety against harmful fine-tuning. The explicit maintenance of a curated paper list is a concrete strength that supports ongoing community use and future guideline development.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and the recommendation to accept. We appreciate the recognition of the threat-model formulation, literature coverage, and the value of the maintained GitHub repository as a community resource.

Circularity Check

0 steps flagged

No significant circularity in survey formulation or review

full rationale

This is a literature survey paper whose central contribution is an organizational threat-model formulation plus a review of external attacks, defenses, and evaluation methods drawn from the cited literature. No equations, fitted parameters, predictions, or derivations appear in the provided abstract or structure. The threat model is presented as a synthesis of existing work rather than a self-referential construction. Any self-citations (if present) are not load-bearing for the core claims, which remain archival and organizational. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it relies on the prior literature it cites for all technical content.

pith-pipeline@v0.9.0 · 5730 in / 1000 out tokens · 20616 ms · 2026-05-23T20:55:59.634569+00:00 · methodology

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
cs.LG 2026-04 unverdicted novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps
cs.CV 2026-04 unverdicted novelty 7.0

GaussLock embeds traps targeting position, scale, rotation, opacity, and color in 3D Gaussian models to degrade unauthorized fine-tunes while preserving authorized performance.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
cs.LG 2025-08 unverdicted novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Alignment Dynamics in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
physics.optics 2026-05 unverdicted novelty 6.0

JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
cs.CR 2026-05 unverdicted novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
An Independent Safety Evaluation of Kimi K2.5
cs.CR 2026-04 conditional novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Secure LLM Fine-Tuning via Safety-Aware Probing
cs.LG 2025-05 unverdicted novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

Reference graph

Works this paper leans on

185 extracted references · 185 canonical work pages · cited by 10 Pith papers · 35 internal anchors

[1]

Identifying and tuning safety neurons in large language models

Anonymous. Identifying and tuning safety neurons in large language models. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=yR47RmND1m. under review

work page 2024
[2]

Measuring the contribution of fine-tuning to individual responses of LLMs

Anonymous. Measuring the contribution of fine-tuning to individual responses of LLMs. In Submitted to The Thirteenth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=3VD92FuNCd. under review

work page 2024
[3]

On evaluating the durability of safeguards for open-weight LLMs

Anonymous. On evaluating the durability of safeguards for open-weight LLMs. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=fXJCqdUSVG. under review

work page 2024
[4]

Safety alignment shouldn’t be complicated

Anonymous. Safety alignment shouldn’t be complicated. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=9H91juqfgb. under review

work page 2024
[5]

SaloRA: Safety-alignment preserved low-rank adaptation

Anonymous. SaloRA: Safety-alignment preserved low-rank adaptation. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=GOoVzE9nSj. under review

work page 2024
[6]

Unraveling and mitigating safety alignment degradation of vision-language models

Anonymous. Unraveling and mitigating safety alignment degradation of vision-language models. In Submitted to The Thirteenth International Conference on Learning Representations,

work page
[7]

under review

URL https://openreview.net/forum?id=EEWpE9cR27. under review

work page
[8]

Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs

Anonymous. Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs. In Submitted to The Thirteenth International Confer- ence on Learning Representations , 2024. URL https://openreview.net/forum?id= vQ0zFYJaMo. under review

work page 2024
[9]

Ayyamperumal, S. G. and Ge, L. Current state of llm risks and ai guardrails. arXiv preprint arXiv:2406.12934, 2024

work page arXiv 2024
[10]

How to backdoor federated learning

Bagdasaryan, E., Veit, A., Hua, Y ., Estrin, D., and Shmatikov, V . How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2938–2948. PMLR, 2020

work page 2020
[11]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023

work page arXiv 2023
[13]

What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp

Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 2280–2292, 2022

work page 2022
[14]

N., Wu, Y ., Rocamora, E

Candogan, L. N., Wu, Y ., Rocamora, E. A., Chrysos, G., and Cevher, V . Single-pass detection of jailbreaking input in large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

work page 2024
[15]

Defending against alignment-breaking attacks via robustly aligned llm

Cao, B., Cao, Y ., Lin, L., and Chen, J. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023

work page arXiv 2023
[16]

Extracting training data from large language models

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021

work page 2021
[17]

Quantifying Memorization Across Neural Language Models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. 19

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

ArXiv:2403.05030 [cs]

Casper, S., Schulze, L., Patel, O., and Hadfield-Menell, D. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024

work page arXiv 2024
[19]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

The dark side of human feedback: Poisoning large language models via user inputs

Chen, B., Guo, H., Wang, G., Wang, Y ., and Yan, Q. The dark side of human feedback: Poisoning large language models via user inputs. arXiv preprint arXiv:2409.00787, 2024

work page arXiv 2024
[21]

Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

Chen, C., Huang, B., Li, Z., Chen, Z., Lai, S., Xu, X., Gu, J.-C., Gu, J., Yao, H., Xiao, C., et al. Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

work page arXiv 2024
[22]

Oml: Open, monetizable, and loyal ai

Cheng, Z., Contente, E., Finch, B., Golev, O., Hayase, J., Miller, A., Moshrefi, N., Nasery, A., Nailwal, S., Oh, S., et al. Oml: Open, monetizable, and loyal ai. Cryptology ePrint Archive, 2024

work page 2024
[23]

E., Stoica, I., and Xing, E

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/

work page 2023
[24]

K., Du, X., and Li, Y

Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. arXiv preprint arXiv:2410.10014, 2024

work page arXiv 2024
[25]

How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chou, S.-Y ., Chen, P.-Y ., and Ho, T.-Y . How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024, 2023

work page 2023
[26]

G., Islam, M

Chowdhury, A. G., Islam, M. M., Kumar, V ., Shezan, F. H., Jain, V ., and Chadha, A. Breaking down the defenses: A comparative survey of attacks on large language models. arXiv preprint arXiv:2403.04786, 2024

work page arXiv 2024
[27]

Comprehensive assessment of jailbreak attacks against llms

Chu, J., Liu, Y ., Yang, Z., Shen, X., Backes, M., and Zhang, Y . Comprehensive assessment of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668, 2024

work page arXiv 2024
[28]

Ai safety in generative ai large language models: A survey

Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv preprint arXiv:2407.18369, 2024

work page arXiv 2024
[29]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Recent advances in attack and defense approaches of large language models

Cui, J., Xu, Y ., Huang, Z., Zhou, S., Jiao, J., and Zhang, J. Recent advances in attack and defense approaches of large language models. arXiv preprint arXiv:2409.03274, 2024

work page arXiv 2024
[32]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Hadi Amini, and Yanzhao Wu

Das, B. C., Amini, M. H., and Wu, Y . Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024

work page arXiv 2024
[34]

Das, N., Peng, S., and Chau, D. H. Skelevision: Towards adversarial resiliency of person tracking with multi-task learning. In European Conference on Computer Vision, pp. 449–466. Springer, 2022

work page 2022
[35]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[36]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Dong, H., Xiong, W., Goyal, D., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Safeguarding large language models: A survey

Dong, Y ., Mu, R., Zhang, Y ., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y ., Hu, J., Meng, J., et al. Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622, 2024

work page arXiv 2024
[38]

Attacks, defenses and evaluations for llm conversation safety: A survey

Dong, Z., Zhou, Z., Yang, C., Shao, J., and Qiao, Y . Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024

work page arXiv 2024
[39]

Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning

Du, Y ., Zhao, S., Cao, J., Ma, M., Zhao, D., Fan, F., Liu, T., and Qin, B. Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning. arXiv preprint arXiv:2410.04524, 2024

work page arXiv 2024
[40]

H., Kumar, M

Eiras, F., Petrov, A., Torr, P. H., Kumar, M. P., and Bibi, A. Mimicking user data: On mitigating fine-tuning risks in closed large language models. arXiv preprint arXiv:2406.10288, 2024

work page arXiv 2024
[41]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

G., Andersen, T., and Zhuang, J

Geren, C., Board, A., Dagher, G. G., Andersen, T., and Zhuang, J. Blockchain for large language model security and safety: A holistic survey. arXiv preprint arXiv:2407.20181, 2024

work page arXiv 2024
[44]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911
[45]

L., and Thomaz, A

Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013

work page 2013
[46]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

The vllm safety paradox: Dual ease in jailbreak attack and defense

Guo, Y ., Jiao, F., Nie, L., and Kankanhalli, M. The vllm safety paradox: Dual ease in jailbreak attack and defense. arXiv preprint arXiv:2411.08410, 2024

work page arXiv 2024
[48]

Regulating chatgpt and other large generative ai models

Hacker, P., Engel, A., and Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1112–1123, 2023

work page 2023
[49]

T., Haghtalab, N., and Steinhardt, J

Halawi, D., Wei, A., Wallace, E., Wang, T. T., Haghtalab, N., and Steinhardt, J. Covert mali- cious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024

work page arXiv 2024
[50]

The effect of fine-tuning on language model toxicity

Hawkins, W., Mittelstadt, B., and Russell, C. The effect of fine-tuning on language model toxicity. arXiv preprint arXiv:2410.15821, 2024

work page arXiv 2024
[51]

He, F., Zhu, T., Ye, D., Liu, B., Zhou, W., and Yu, P. S. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024
[52]

What’s in your" safe" data?: Identifying benign data that breaks safety

He, L., Xia, M., and Henderson, P. What’s in your" safe" data?: Identifying benign data that breaks safety. arXiv preprint arXiv:2404.01099, 2024

work page arXiv 2024
[53]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[54]

Safe lora: the silver lining of reducing safety risks when fine-tuning large language models

Hsu, C.-Y ., Tsai, Y .-L., Lin, C.-H., Chen, P.-Y ., Yu, C.-M., and Huang, C.-Y . Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. arXiv preprint arXiv:2405.16833, 2024

work page arXiv 2024
[55]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[56]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 21

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

F., and Liu, L

Hu, S., Huang, T., ˙Ilhan, F., Tekin, S. F., and Liu, L. Large language model-powered smart contract vulnerability detection: New perspectives. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 297–306. IEEE, 2023

work page 2023
[58]

Bert4eth: A pre-trained transformer for ethereum fraud detection

Hu, S., Zhang, Z., Luo, B., Lu, S., He, B., and Liu, L. Bert4eth: A pre-trained transformer for ethereum fraud detection. In Proceedings of the ACM Web Conference 2023, pp. 2189–2197, 2023

work page 2023
[59]

Zipzap: Efficient training of language models for large-scale fraud detection on blockchain

Hu, S., Huang, T., Chow, K.-H., Wei, W., Wu, Y ., and Liu, L. Zipzap: Efficient training of language models for large-scale fraud detection on blockchain. In Proceedings of the ACM on Web Conference 2024, pp. 2807–2816, 2024

work page 2024
[60]

A survey on large language model-based game agents

Hu, S., Huang, T., Ilhan, F., Tekin, S., Liu, G., Kompella, R., and Liu, L. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

work page arXiv 2024
[61]

Composite backdoor attacks against large language models

Huang, H., Zhao, Z., Backes, M., Shen, Y ., and Zhang, Y . Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023

work page arXiv 2023
[62]

Achieving personalized federated learning with sparse local models

Huang, T., Liu, S., Shen, L., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. arXiv preprint arXiv:2201.11380, 2022

work page arXiv 2022
[63]

Fusion of global and local knowledge for personalized federated learning

Huang, T., Shen, L., Sun, Y ., Lin, W., and Tao, D. Fusion of global and local knowledge for personalized federated learning. arXiv preprint arXiv:2302.11051, 2023

work page arXiv 2023
[64]

Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning

Huang, T., Bhattacharya, G., Joshi, P., Kimball, J., and Liu, L. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2408.09600, 2024

work page arXiv 2024
[65]

Lockdown: backdoor defense for federated learning with isolated subspace training

Huang, T., Hu, S., Chow, K.-H., Ilhan, F., Tekin, S., and Liu, L. Lockdown: backdoor defense for federated learning with isolated subspace training. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[66]

F., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024

work page arXiv 2024
[67]

F., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.18641, 2024

work page arXiv 2024
[68]

Vaccine: Perturbation-aware alignment for large language model

Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024

work page arXiv 2024
[69]

A survey of safety and trustworthiness of large language models through the lens of verification and validation

Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y ., Wu, C., Bensalem, S., Mu, R., Qi, Y ., Zhao, X., et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024

work page 2024
[70]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Adaptive deep neural network inference optimization with eenet

Ilhan, F., Chow, K.-H., Hu, S., Huang, T., Tekin, S., Wei, W., Wu, Y ., Lee, M., Kompella, R., Latapie, H., et al. Adaptive deep neural network inference optimization with eenet. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pp. 1373–1382, 2024

work page 2024
[72]

F., Huang, T., Hu, S., and Liu, L

Ilhan, F., Su, G., Tekin, S. F., Huang, T., Hu, S., and Liu, L. Resource-efficient transformer pruning for finetuning of large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16206–16215, 2024

work page 2024
[73]

Measuring forgetting of memorized training examples

Jagielski, M., Thakkar, O., Tramer, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., et al. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022. 22

work page arXiv 2022
[74]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Jain, N., Schwarzschild, A., Wen, Y ., Somepalli, G., Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A., Geiping, J., and Goldstein, T. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

S., Dick, R

Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rocktäschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

work page arXiv 2023
[76]

S., Oksuz, K., Joy, T., Torr, P

Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P. H., Sanyal, A., and Dokania, P. K. What makes and breaks safety fine-tuning? mechanistic study. arXiv preprint arXiv:2407.10264, 2024

work page arXiv 2024
[77]

Active data pattern extraction attacks on generative language models

Jayaraman, B., Ghosh, E., Inan, H., Chase, M., Roy, S., and Dai, W. Active data pattern extraction attacks on generative language models. arXiv preprint arXiv:2207.10802, 2022

work page arXiv 2022
[78]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[79]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., and Wang, H. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024
[80]

Deduplicating training data mitigates privacy risks in language models

Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022

work page 2022

Showing first 80 references.

[1] [1]

Identifying and tuning safety neurons in large language models

Anonymous. Identifying and tuning safety neurons in large language models. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=yR47RmND1m. under review

work page 2024

[2] [2]

Measuring the contribution of fine-tuning to individual responses of LLMs

Anonymous. Measuring the contribution of fine-tuning to individual responses of LLMs. In Submitted to The Thirteenth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=3VD92FuNCd. under review

work page 2024

[3] [3]

On evaluating the durability of safeguards for open-weight LLMs

Anonymous. On evaluating the durability of safeguards for open-weight LLMs. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=fXJCqdUSVG. under review

work page 2024

[4] [4]

Safety alignment shouldn’t be complicated

Anonymous. Safety alignment shouldn’t be complicated. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=9H91juqfgb. under review

work page 2024

[5] [5]

SaloRA: Safety-alignment preserved low-rank adaptation

Anonymous. SaloRA: Safety-alignment preserved low-rank adaptation. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=GOoVzE9nSj. under review

work page 2024

[6] [6]

Unraveling and mitigating safety alignment degradation of vision-language models

Anonymous. Unraveling and mitigating safety alignment degradation of vision-language models. In Submitted to The Thirteenth International Conference on Learning Representations,

work page

[7] [7]

under review

URL https://openreview.net/forum?id=EEWpE9cR27. under review

work page

[8] [8]

Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs

Anonymous. Your task may vary: A systematic understanding of alignment and safety degradation when fine-tuning LLMs. In Submitted to The Thirteenth International Confer- ence on Learning Representations , 2024. URL https://openreview.net/forum?id= vQ0zFYJaMo. under review

work page 2024

[9] [9]

Ayyamperumal, S. G. and Ge, L. Current state of llm risks and ai guardrails. arXiv preprint arXiv:2406.12934, 2024

work page arXiv 2024

[10] [10]

How to backdoor federated learning

Bagdasaryan, E., Veit, A., Hua, Y ., Estrin, D., and Shmatikov, V . How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2938–2948. PMLR, 2020

work page 2020

[11] [11]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023

work page arXiv 2023

[13] [13]

What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp

Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 2280–2292, 2022

work page 2022

[14] [14]

N., Wu, Y ., Rocamora, E

Candogan, L. N., Wu, Y ., Rocamora, E. A., Chrysos, G., and Cevher, V . Single-pass detection of jailbreaking input in large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

work page 2024

[15] [15]

Defending against alignment-breaking attacks via robustly aligned llm

Cao, B., Cao, Y ., Lin, L., and Chen, J. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023

work page arXiv 2023

[16] [16]

Extracting training data from large language models

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021

work page 2021

[17] [17]

Quantifying Memorization Across Neural Language Models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. 19

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

ArXiv:2403.05030 [cs]

Casper, S., Schulze, L., Patel, O., and Hadfield-Menell, D. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024

work page arXiv 2024

[19] [19]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

The dark side of human feedback: Poisoning large language models via user inputs

Chen, B., Guo, H., Wang, G., Wang, Y ., and Yan, Q. The dark side of human feedback: Poisoning large language models via user inputs. arXiv preprint arXiv:2409.00787, 2024

work page arXiv 2024

[21] [21]

Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

Chen, C., Huang, B., Li, Z., Chen, Z., Lai, S., Xu, X., Gu, J.-C., Gu, J., Yao, H., Xiao, C., et al. Can editing llms inject harm? arXiv preprint arXiv:2407.20224, 2024

work page arXiv 2024

[22] [22]

Oml: Open, monetizable, and loyal ai

Cheng, Z., Contente, E., Finch, B., Golev, O., Hayase, J., Miller, A., Moshrefi, N., Nasery, A., Nailwal, S., Oh, S., et al. Oml: Open, monetizable, and loyal ai. Cryptology ePrint Archive, 2024

work page 2024

[23] [23]

E., Stoica, I., and Xing, E

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/

work page 2023

[24] [24]

K., Du, X., and Li, Y

Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. arXiv preprint arXiv:2410.10014, 2024

work page arXiv 2024

[25] [25]

How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chou, S.-Y ., Chen, P.-Y ., and Ho, T.-Y . How to backdoor diffusion models? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024, 2023

work page 2023

[26] [26]

G., Islam, M

Chowdhury, A. G., Islam, M. M., Kumar, V ., Shezan, F. H., Jain, V ., and Chadha, A. Breaking down the defenses: A comparative survey of attacks on large language models. arXiv preprint arXiv:2403.04786, 2024

work page arXiv 2024

[27] [27]

Comprehensive assessment of jailbreak attacks against llms

Chu, J., Liu, Y ., Yang, Z., Shen, X., Backes, M., and Zhang, Y . Comprehensive assessment of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668, 2024

work page arXiv 2024

[28] [28]

Ai safety in generative ai large language models: A survey

Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv preprint arXiv:2407.18369, 2024

work page arXiv 2024

[29] [29]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

Recent advances in attack and defense approaches of large language models

Cui, J., Xu, Y ., Huang, Z., Zhou, S., Jiao, J., and Zhang, J. Recent advances in attack and defense approaches of large language models. arXiv preprint arXiv:2409.03274, 2024

work page arXiv 2024

[32] [32]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Hadi Amini, and Yanzhao Wu

Das, B. C., Amini, M. H., and Wu, Y . Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024

work page arXiv 2024

[34] [34]

Das, N., Peng, S., and Chau, D. H. Skelevision: Towards adversarial resiliency of person tracking with multi-task learning. In European Conference on Computer Vision, pp. 449–466. Springer, 2022

work page 2022

[35] [35]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[36] [36]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Dong, H., Xiong, W., Goyal, D., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Safeguarding large language models: A survey

Dong, Y ., Mu, R., Zhang, Y ., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y ., Hu, J., Meng, J., et al. Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622, 2024

work page arXiv 2024

[38] [38]

Attacks, defenses and evaluations for llm conversation safety: A survey

Dong, Z., Zhou, Z., Yang, C., Shao, J., and Qiao, Y . Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024

work page arXiv 2024

[39] [39]

Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning

Du, Y ., Zhao, S., Cao, J., Ma, M., Zhao, D., Fan, F., Liu, T., and Qin, B. Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning. arXiv preprint arXiv:2410.04524, 2024

work page arXiv 2024

[40] [40]

H., Kumar, M

Eiras, F., Petrov, A., Torr, P. H., Kumar, M. P., and Bibi, A. Mimicking user data: On mitigating fine-tuning risks in closed large language models. arXiv preprint arXiv:2406.10288, 2024

work page arXiv 2024

[41] [41]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

G., Andersen, T., and Zhuang, J

Geren, C., Board, A., Dagher, G. G., Andersen, T., and Zhuang, J. Blockchain for large language model security and safety: A holistic survey. arXiv preprint arXiv:2407.20181, 2024

work page arXiv 2024

[44] [44]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911

[45] [45]

L., and Thomaz, A

Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013

work page 2013

[46] [46]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

The vllm safety paradox: Dual ease in jailbreak attack and defense

Guo, Y ., Jiao, F., Nie, L., and Kankanhalli, M. The vllm safety paradox: Dual ease in jailbreak attack and defense. arXiv preprint arXiv:2411.08410, 2024

work page arXiv 2024

[48] [48]

Regulating chatgpt and other large generative ai models

Hacker, P., Engel, A., and Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1112–1123, 2023

work page 2023

[49] [49]

T., Haghtalab, N., and Steinhardt, J

Halawi, D., Wei, A., Wallace, E., Wang, T. T., Haghtalab, N., and Steinhardt, J. Covert mali- cious finetuning: Challenges in safeguarding llm adaptation. arXiv preprint arXiv:2406.20053, 2024

work page arXiv 2024

[50] [50]

The effect of fine-tuning on language model toxicity

Hawkins, W., Mittelstadt, B., and Russell, C. The effect of fine-tuning on language model toxicity. arXiv preprint arXiv:2410.15821, 2024

work page arXiv 2024

[51] [51]

He, F., Zhu, T., Ye, D., Liu, B., Zhou, W., and Yu, P. S. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024

[52] [52]

What’s in your" safe" data?: Identifying benign data that breaks safety

He, L., Xia, M., and Henderson, P. What’s in your" safe" data?: Identifying benign data that breaks safety. arXiv preprint arXiv:2404.01099, 2024

work page arXiv 2024

[53] [53]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[54] [54]

Safe lora: the silver lining of reducing safety risks when fine-tuning large language models

Hsu, C.-Y ., Tsai, Y .-L., Lin, C.-H., Chen, P.-Y ., Yu, C.-M., and Huang, C.-Y . Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. arXiv preprint arXiv:2405.16833, 2024

work page arXiv 2024

[55] [55]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[56] [56]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 21

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

F., and Liu, L

Hu, S., Huang, T., ˙Ilhan, F., Tekin, S. F., and Liu, L. Large language model-powered smart contract vulnerability detection: New perspectives. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 297–306. IEEE, 2023

work page 2023

[58] [58]

Bert4eth: A pre-trained transformer for ethereum fraud detection

Hu, S., Zhang, Z., Luo, B., Lu, S., He, B., and Liu, L. Bert4eth: A pre-trained transformer for ethereum fraud detection. In Proceedings of the ACM Web Conference 2023, pp. 2189–2197, 2023

work page 2023

[59] [59]

Zipzap: Efficient training of language models for large-scale fraud detection on blockchain

Hu, S., Huang, T., Chow, K.-H., Wei, W., Wu, Y ., and Liu, L. Zipzap: Efficient training of language models for large-scale fraud detection on blockchain. In Proceedings of the ACM on Web Conference 2024, pp. 2807–2816, 2024

work page 2024

[60] [60]

A survey on large language model-based game agents

Hu, S., Huang, T., Ilhan, F., Tekin, S., Liu, G., Kompella, R., and Liu, L. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

work page arXiv 2024

[61] [61]

Composite backdoor attacks against large language models

Huang, H., Zhao, Z., Backes, M., Shen, Y ., and Zhang, Y . Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023

work page arXiv 2023

[62] [62]

Achieving personalized federated learning with sparse local models

Huang, T., Liu, S., Shen, L., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. arXiv preprint arXiv:2201.11380, 2022

work page arXiv 2022

[63] [63]

Fusion of global and local knowledge for personalized federated learning

Huang, T., Shen, L., Sun, Y ., Lin, W., and Tao, D. Fusion of global and local knowledge for personalized federated learning. arXiv preprint arXiv:2302.11051, 2023

work page arXiv 2023

[64] [64]

Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning

Huang, T., Bhattacharya, G., Joshi, P., Kimball, J., and Liu, L. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2408.09600, 2024

work page arXiv 2024

[65] [65]

Lockdown: backdoor defense for federated learning with isolated subspace training

Huang, T., Hu, S., Chow, K.-H., Ilhan, F., Tekin, S., and Liu, L. Lockdown: backdoor defense for federated learning with isolated subspace training. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[66] [66]

F., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024

work page arXiv 2024

[67] [67]

F., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.18641, 2024

work page arXiv 2024

[68] [68]

Vaccine: Perturbation-aware alignment for large language model

Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024

work page arXiv 2024

[69] [69]

A survey of safety and trustworthiness of large language models through the lens of verification and validation

Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y ., Wu, C., Bensalem, S., Mu, R., Qi, Y ., Zhao, X., et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024

work page 2024

[70] [70]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Adaptive deep neural network inference optimization with eenet

Ilhan, F., Chow, K.-H., Hu, S., Huang, T., Tekin, S., Wei, W., Wu, Y ., Lee, M., Kompella, R., Latapie, H., et al. Adaptive deep neural network inference optimization with eenet. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pp. 1373–1382, 2024

work page 2024

[72] [72]

F., Huang, T., Hu, S., and Liu, L

Ilhan, F., Su, G., Tekin, S. F., Huang, T., Hu, S., and Liu, L. Resource-efficient transformer pruning for finetuning of large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16206–16215, 2024

work page 2024

[73] [73]

Measuring forgetting of memorized training examples

Jagielski, M., Thakkar, O., Tramer, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., et al. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022. 22

work page arXiv 2022

[74] [74]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Jain, N., Schwarzschild, A., Wen, Y ., Somepalli, G., Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A., Geiping, J., and Goldstein, T. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

S., Dick, R

Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rocktäschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

work page arXiv 2023

[76] [76]

S., Oksuz, K., Joy, T., Torr, P

Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P. H., Sanyal, A., and Dokania, P. K. What makes and breaks safety fine-tuning? mechanistic study. arXiv preprint arXiv:2407.10264, 2024

work page arXiv 2024

[77] [77]

Active data pattern extraction attacks on generative language models

Jayaraman, B., Ghosh, E., Inan, H., Chase, M., Roy, S., and Dai, W. Active data pattern extraction attacks on generative language models. arXiv preprint arXiv:2207.10802, 2022

work page arXiv 2022

[78] [78]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023

[79] [79]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., and Wang, H. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024

[80] [80]

Deduplicating training data mitigates privacy risks in language models

Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022

work page 2022