GradShield: Alignment Preserving Finetuning

· 2026 · cs.CL · arXiv 2605.14194

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

representative citing papers

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning cs.LG · 2026-06-01 · unverdicted · none · ref 40 · internal anchor
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.

GradShield: Alignment Preserving Finetuning

fields

years

verdicts

representative citing papers

citing papers explorer