GradShield removes data points likely to cause safety misalignment during LLM finetuning by computing a Finetuning Implicit Harmfulness Score and applying adaptive thresholding, keeping attack success rates below 6% while preserving utility.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
GradShield: Alignment Preserving Finetuning
GradShield removes data points likely to cause safety misalignment during LLM finetuning by computing a Finetuning Implicit Harmfulness Score and applying adaptive thresholding, keeping attack success rates below 6% while preserving utility.