GradShield: Alignment Preserving Finetuning

Basel Alomair; David Wagner; Emad A. Alghamdi; Patrick Mendoza; Raluca Ada Popa; Xiao Huang; Zhanhao Hu

arxiv: 2605.14194 · v2 · pith:6XB6HN2Qnew · submitted 2026-05-13 · 💻 cs.CL

GradShield: Alignment Preserving Finetuning

Zhanhao Hu , Xiao Huang , Patrick Mendoza , Emad A. Alghamdi , Basel Alomair , Raluca Ada Popa , David Wagner This is my paper

Pith reviewed 2026-06-30 21:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM finetuningalignment preservationdata filteringsafety misalignmentharmfulness scoreattack success rateutility preservation

0 comments

The pith

GradShield removes data points that would cause misalignment during LLM finetuning by scoring their implicit harmfulness and applying adaptive thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GradShield to prevent safety misalignment in LLMs after finetuning on utility tasks. It computes a Finetuning Implicit Harmfulness Score for each training example and removes those above an adaptive threshold. This filtering keeps attack success rates below 6 percent across tested harm levels while utility metrics stay close to the unfiltered baseline. A sympathetic reader would care because standard finetuning often introduces hidden risks even from data that looks benign. The method is evaluated on multiple tasks with varying amounts of harmful content mixed in.

Core claim

GradShield identifies and removes potentially harmful data points before finetuning by computing a Finetuning Implicit Harmfulness Score (FIHS) for each example and applying an adaptive thresholding algorithm; when applied to utility fine-tuning tasks that contain varying levels of harmful data, the resulting models maintain an attack success rate below 6 percent while preserving utility performance, outperforming prior filtering baselines.

What carries the argument

The Finetuning Implicit Harmfulness Score (FIHS) paired with an adaptive thresholding algorithm, which scores each data point for its likelihood of causing misalignment and filters out the highest-scoring examples.

If this is right

Finetuned LLMs stay aligned even when the training mix contains both explicit and implicit harmful examples.
Utility on downstream tasks remains comparable to models trained without any filtering.
The same filtering procedure works across multiple tasks and different overall harm levels in the data.
GradShield exceeds the safety-utility trade-off achieved by existing baseline filtering methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with other alignment techniques such as reinforcement learning from human feedback to further reduce residual risks.
Testing on models larger than those evaluated here would show whether the FIHS computation scales without additional overhead.
If the adaptive threshold is made task-specific rather than global, the method might retain more useful data in low-harm settings.

Load-bearing premise

The assumption that the Finetuning Implicit Harmfulness Score reliably identifies data that will cause misalignment across different tasks and harm levels.

What would settle it

Running GradShield on a new utility task where the filtered model still shows attack success rate above 10 percent on standard safety benchmarks while matching the utility of the unfiltered model.

Figures

Figures reproduced from arXiv: 2605.14194 by Basel Alomair, David Wagner, Emad A. Alghamdi, Patrick Mendoza, Raluca Ada Popa, Xiao Huang, Zhanhao Hu.

**Figure 2.** Figure 2: Distribution of FIHS scores of utility and harmfulness datasets. (a) FIHS scores. (b) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GradShield uses gradient-derived FIHS scores and adaptive thresholding to filter harmful finetuning data, but the abstract gives no concrete evidence that this beats standard baselines or generalizes beyond the tested mixtures.

read the letter

GradShield filters data before finetuning by scoring each point with a gradient-based Finetuning Implicit Harmfulness Score and then applying an adaptive threshold. The headline claim is that this keeps attack success rate below 6 percent across several utility tasks while utility stays intact.

The method is new in its specific combination of gradient scoring for implicit harm and the adaptive threshold step. It directly targets the practical issue that even non-obvious data can degrade alignment, and the authors test it on multiple tasks with varying harm levels.

The main weakness is the missing experimental detail. The abstract does not describe how FIHS is computed from the gradients, what the baseline methods actually are, which datasets were used, or any statistical tests. Without those, it is impossible to check whether the score truly isolates misalignment risk or simply tracks gradient size or data statistics. The stress-test point about task-dependent harm is relevant here: if the threshold is tuned to the training mixture, performance on new harm distributions could drop.

This is a paper for people who finetune LLMs and need a concrete filtering tool. A reader already working on alignment preservation might extract the method and try it, but the current write-up does not yet support strong conclusions. It deserves peer review because the underlying problem matters and the approach is straightforward enough to evaluate with the right experiments.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GradShield, a data-filtering approach for LLM finetuning that computes a Finetuning Implicit Harmfulness Score (FIHS) per data point from gradient information and applies an adaptive thresholding algorithm to remove points likely to cause misalignment. The method is evaluated on multiple utility fine-tuning tasks with varying levels of harmful data; the central claim is that it outperforms baselines by keeping Attack Success Rate (ASR) below 6% while preserving utility metrics.

Significance. If the FIHS reliably ranks data by its causal effect on future misalignment and the adaptive threshold generalizes across tasks and harm distributions, the technique would address a practical safety concern in post-training. The gradient-based scoring is a reasonable starting point, but the absence of any experimental details, baseline definitions, dataset descriptions, or statistical tests in the abstract prevents assessment of whether the headline result holds.

major comments (2)

[Abstract] Abstract: the claim that GradShield 'outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below 6%' supplies no experimental details, baseline descriptions, dataset information, or statistical evidence, so the central empirical result cannot be checked against the claim.
[Abstract] The headline result requires that FIHS plus adaptive thresholding separates misalignment-causing data across tasks and harm levels. No evidence is supplied that FIHS captures causal harm rather than gradient magnitude or data statistics, nor that the threshold generalizes beyond the training mixture; if either fails, the claimed outperformance on unseen distributions would not hold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify points about the abstract. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that GradShield 'outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below 6%' supplies no experimental details, baseline descriptions, dataset information, or statistical evidence, so the central empirical result cannot be checked against the claim.

Authors: Abstracts are space-constrained summaries. The full manuscript provides the requested information: baseline methods are defined in Section 3 (including random filtering and gradient-norm baselines), datasets are described in Section 4 (utility tasks with controlled injection of harmful data at multiple ratios), and statistical tests appear in Section 5 with reported means and standard deviations over multiple runs. The headline claim is drawn directly from those experiments. revision: partial
Referee: [Abstract] The headline result requires that FIHS plus adaptive thresholding separates misalignment-causing data across tasks and harm levels. No evidence is supplied that FIHS captures causal harm rather than gradient magnitude or data statistics, nor that the threshold generalizes beyond the training mixture; if either fails, the claimed outperformance on unseen distributions would not hold.

Authors: The paper evaluates the combination of FIHS and adaptive thresholding on several distinct utility tasks that vary in both domain and the proportion of harmful data. Results show consistent ASR below 6% with utility retention, outperforming the listed baselines. FIHS is constructed from per-example gradient signals during a simulated finetuning step, which we argue encodes more task-specific alignment impact than raw magnitude or surface statistics alone. The adaptive threshold is recomputed from the score distribution of each new mixture, providing a form of per-dataset generalization. We do not present interventional causal evidence (e.g., counterfactual data edits), as the work focuses on practical filtering efficacy; the empirical consistency across the tested distributions is the evidence offered for the headline claim. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description introduce GradShield via FIHS computation and adaptive thresholding but contain no equations, derivations, or load-bearing steps. No self-definitional reductions, fitted inputs renamed as predictions, or self-citation chains are present. The central claim (ASR <6% with utility preservation) is presented as an empirical outcome without any visible reduction to the method's own fitted quantities by construction. This is the expected honest non-finding for a methods paper whose derivation chain is not shown to collapse into its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no experimental details, and no description of modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5694 in / 1060 out tokens · 36315 ms · 2026-06-30T21:08:26.209204+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning
cs.LG 2026-06 unverdicted novelty 5.0

DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8...

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Character-level Convolutional Networks for Text Classification

Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https: //www.aclweb.org/anthology/D19-5409. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matth...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-5409 2016
[2]

I cannot fulfill this request

+ (1−π)N(FIHS(x f)| µ2, σ2 2)] 7:ifLogL 2 −LogL 1 > αthen 8:Choose Gaussian mixture model 9:labels←AssignComponents({FIHS(x f)}, π, µ1, σ1, µ2, σ2) 10:t←min(max({FIHS(x f)|labels(x f) = 0}),max({FIHS(x f)|labels(x f) = 1})) 11:else 12:Choose single Gaussian model 13:t←µ+ 2σ 14:end if 15:returnThresholdt A.2 Proxy Safety Score Justification 0 5 10 15 20 25...

2023

[1] [1]

Character-level Convolutional Networks for Text Classification

Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https: //www.aclweb.org/anthology/D19-5409. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matth...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-5409 2016

[2] [2]

I cannot fulfill this request

+ (1−π)N(FIHS(x f)| µ2, σ2 2)] 7:ifLogL 2 −LogL 1 > αthen 8:Choose Gaussian mixture model 9:labels←AssignComponents({FIHS(x f)}, π, µ1, σ1, µ2, σ2) 10:t←min(max({FIHS(x f)|labels(x f) = 0}),max({FIHS(x f)|labels(x f) = 1})) 11:else 12:Choose single Gaussian model 13:t←µ+ 2σ 14:end if 15:returnThresholdt A.2 Proxy Safety Score Justification 0 5 10 15 20 25...

2023