arxiv: 2605.14194 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

GradShield: Alignment Preserving Finetuning

Zhanhao Hu , Xiao Huang , Patrick Mendoza , Emad A. Alghamdi , Basel Alomair , Raluca Ada Popa , David Wagner

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords alignment preservationfinetuning safetyharmful data filteringLLM safetyimplicit harmfulnessadaptive thresholding

0 comments

The pith

GradShield filters data points by their implicit harmfulness score to keep finetuned LLMs aligned while retaining utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a filtering technique that scores each finetuning example for how likely it is to push the model toward unsafe behaviors, even when the data looks benign. It then applies an adaptive threshold to drop the riskiest examples before training starts. This keeps the attack success rate low across multiple tasks and levels of contamination without hurting the model's performance on intended objectives. A reader would care because finetuning often erodes the safety properties built into base models, and a lightweight pre-filter offers a direct way to protect alignment during adaptation.

Core claim

GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for every data point and removes those above an adaptive threshold, producing models whose attack success rate stays below 6 percent on safety benchmarks while utility metrics remain comparable to models trained on unfiltered data.

What carries the argument

The Finetuning Implicit Harmfulness Score (FIHS), which quantifies the expected contribution of a single training example to post-finetuning misalignment.

If this is right

Finetuning runs can include more diverse data sources without separate safety post-processing steps.
The same scoring approach can be applied to different base models and task distributions.
Utility degradation from aggressive filtering is avoided because the threshold adapts to the observed score distribution.
Models remain resistant to both explicit and implicit harmful examples in the training mix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be inserted as a preprocessing stage before any existing alignment or safety-tuning pipeline.
Extending the score to multimodal or instruction-following data might protect against new forms of misalignment.
If the score correlates with gradient directions that increase harmful output probability, it might generalize to other optimization objectives beyond standard language modeling.

Load-bearing premise

That the FIHS value for a data point reliably predicts whether including it will increase the model's vulnerability to attacks after finetuning.

What would settle it

A controlled experiment that adds only high-FIHS examples to a safe dataset and measures whether the resulting model shows no rise in attack success rate compared with the unfiltered baseline.

Figures

Figures reproduced from arXiv: 2605.14194 by Basel Alomair, David Wagner, Emad A. Alghamdi, Patrick Mendoza, Raluca Ada Popa, Xiao Huang, Zhanhao Hu.

**Figure 2.** Figure 2: Distribution of FIHS scores of utility and harmfulness datasets. (a) FIHS scores. (b) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GradShield's gradient filter keeps ASR under 6% in tests but the evidence does not yet show the FIHS score specifically catches misalignment risk rather than just shrinking the dataset.

read the letter

The paper's main contribution is a preprocessing filter called GradShield that scores each finetuning example with a gradient-derived Finetuning Implicit Harmfulness Score and drops points above an adaptive threshold. The goal is to stop even non-obvious harmful data from eroding safety alignment while keeping utility intact. They test this on several utility tasks with varying mixes of harmful data and report that the filtered models stay below 6% attack success rate, beating the baselines they compare against without large drops in performance metrics. That empirical pattern is the clearest positive result here and addresses a practical pain point in real deployment pipelines. The approach is straightforward enough that it could be tried as a standard step before safety-critical fine-tuning. The soft spot is validation of the score itself. The abstract gives no equations for how FIHS is calculated from gradients, no ablation on the adaptive threshold, and no direct check that high-scoring points actually produce higher misalignment when left in the data. Without those, the low ASR numbers could come from training on less data overall rather than from precise removal of the right examples. The stress-test note flags exactly this gap in per-point correlation, and nothing in the provided description closes it. This work is aimed at people running LLM fine-tuning who already care about safety guardrails. A reader looking for concrete preprocessing ideas would find the reported numbers useful even if the mechanism needs more proof. It deserves peer review because the problem is common and the results are concrete enough to merit checking the details and asking for the missing ablations. I would send it out rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces GradShield, a filtering method for LLM finetuning that computes a Finetuning Implicit Harmfulness Score (FIHS) per data point via gradient signals from a proxy step and applies adaptive thresholding to remove points likely to cause misalignment. It evaluates the approach on multiple utility finetuning tasks with varying levels of harmful data, claiming that the resulting models achieve ASR below 6% while preserving utility and outperforming baselines.

Significance. If the central claim holds, GradShield would offer a practical, gradient-based safeguard for maintaining alignment during finetuning, addressing a real vulnerability in post-training safety. The empirical results across harmful-data regimes and the focus on both safety and utility metrics would make it a useful contribution to the alignment literature, particularly if the method proves more precise than volume-based filtering.

major comments (3)

[Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.
[Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.
[Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.

minor comments (2)

[Abstract] The abstract and introduction use the term 'parameter-free' for the thresholding step, but the adaptive algorithm description implies at least one tunable sensitivity parameter; clarify the exact claim.
[Figure 2] Figure 2 (ASR vs. harmful-data ratio) lacks error bars or statistical significance tests across the multiple runs; add these to support the 'consistently below 6%' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and empirical rigor. We will revise the manuscript to address each point as detailed below.

read point-by-point responses

Referee: [Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.

Authors: We agree that the current version lacks the necessary implementation details for full reproducibility and verification. In the revised manuscript, we will add the full mathematical definition of the Finetuning Implicit Harmfulness Score (FIHS), including the gradient computation from the proxy finetuning step, the aggregation formula across tokens or layers, the adaptive thresholding procedure, and accompanying pseudocode. This will explicitly show how the score is derived and allow independent verification of the mechanism. revision: yes
Referee: [Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.

Authors: We acknowledge that the absence of a direct comparison to random or size-matched filtering at matched retention rates leaves open the possibility that safety gains stem from data volume reduction alone. We will add this ablation in the revised Section 4, evaluating GradShield against random sampling and size-matched baselines at identical retention rates across all harmful-data regimes. The results will be reported with the same ASR and utility metrics to isolate the contribution of the FIHS-based selection. revision: yes
Referee: [Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.

Authors: We agree that aggregate metrics alone do not directly validate the per-point ranking assumption. In the revision, we will include controlled retention experiments that sort data by FIHS score and measure ASR as a function of the retention threshold (e.g., retaining the lowest-FIHS fraction). We will also report Spearman or Pearson correlation between FIHS values and per-point misalignment indicators where feasible, providing direct evidence that higher FIHS scores correspond to greater misalignment risk. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GradShield's FIHS-based filtering

full rationale

The paper defines FIHS via gradient-based computation on a proxy finetuning step and applies adaptive thresholding to remove points before the main finetuning. No equation or step reduces the reported ASR (<6%) or utility metrics back to the FIHS scores by construction, nor does any central claim rest on a self-citation chain or fitted parameter renamed as prediction. The evaluation is empirical across multiple tasks and harm levels, with FIHS operating independently of the final safety metrics. This matches the default expectation of a non-circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unproven premise that gradient signals during early finetuning steps can reliably predict post-training misalignment; no independent evidence for this mapping is supplied in the abstract.

axioms (1)

domain assumption Gradient-based signals computed on a data point during finetuning correlate with the degree of safety misalignment the point will induce.
Invoked to justify the FIHS as a filtering criterion.

invented entities (1)

Finetuning Implicit Harmfulness Score (FIHS) no independent evidence
purpose: Quantify the latent harmfulness of a training example with respect to model alignment.
New metric introduced by the paper; no external validation or falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5463 in / 1270 out tokens · 39443 ms · 2026-05-15T04:40:50.095750+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[5]

2023 , eprint=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. 2023 , eprint=

work page 2023
[6]

2024 , eprint=

Robustifying Safety-Aligned Large Language Models through Clean Data Curation , author=. 2024 , eprint=

work page 2024
[7]

2024 , eprint=

Representation Noising: A Defence Mechanism Against Harmful Finetuning , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation , author=. 2025 , eprint=

work page 2025
[9]

2024 , eprint=

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=

work page 2024
[10]

2024 , eprint=

Fine-tuning can cripple your foundation model; preserving features may be the solution , author=. 2024 , eprint=

work page 2024
[11]

2024 , eprint=

Safety Alignment Should Be Made More Than Just a Few Tokens Deep , author=. 2024 , eprint=

work page 2024
[12]

2024 , eprint=

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions , author=. 2024 , eprint=

work page 2024
[13]

2024 , eprint=

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=

work page 2024
[14]

2025 , eprint=

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation , author=. 2025 , eprint=

work page 2025
[15]

2024 , eprint=

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[16]

Advances in Neural Information Processing Systems , volume=

Toxicity detection for free , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

2024 , eprint=

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning , author=. 2024 , eprint=

work page 2024
[18]

2025 , eprint=

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models , author=. 2025 , eprint=

work page 2025
[19]

D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.449

work page doi:10.18653/v1/2021.findings-acl.449 2021
[20]

2016 , eprint=

Character-level Convolutional Networks for Text Classification , author=. 2016 , eprint=

work page 2016
[21]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[22]

arXiv preprint arXiv:2407.15549 , year=

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs , author=. arXiv preprint arXiv:2407.15549 , year=

work page arXiv
[23]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[24]

arXiv preprint arXiv:2408.09600 , year=

Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning , author=. arXiv preprint arXiv:2408.09600 , year=

work page arXiv
[25]

arXiv preprint arXiv:2507.21182 , year=

SDD: Self-Degraded Defense against Malicious Fine-tuning , author=. arXiv preprint arXiv:2507.21182 , year=

work page arXiv
[26]

2025 , eprint=

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. 2025 , eprint=

work page 2025
[28]

2024 , eprint=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , eprint=

work page 2024
[29]

2024 , eprint=

Removing RLHF Protections in GPT-4 via Fine-Tuning , author=. 2024 , eprint=

work page 2024
[30]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[31]

URL http: //dx.doi.org/10.18653/v1/D19-5409

Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander. SAMS um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization. 2019. doi:10.18653/v1/D19-5409

work page doi:10.18653/v1/d19-5409 2019
[32]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2410.10014 , year=

Safety-aware fine-tuning of large language models , author=. arXiv preprint arXiv:2410.10014 , year=

work page arXiv
[34]

2022 , eprint=

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=

work page 2022
[35]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024
[36]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[37]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[38]

2023 , eprint=

PAL: Program-aided Language Models , author=. 2023 , eprint=

work page 2023
[39]

2023 , eprint=

Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , author=. 2023 , eprint=

work page 2023
[40]

2024 , eprint=

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author=. 2024 , eprint=

work page 2024
[41]

2024 , eprint=

Safety-Aware Fine-Tuning of Large Language Models , author=. 2024 , eprint=

work page 2024
[42]

2024 , eprint=

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection , author=. 2024 , eprint=

work page 2024
[43]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[44]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[45]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

LLaMA 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv , year =. doi:10.48550/arXiv.2307.09288 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288
[47]

arXiv , year =

The LLaMA 3 Herd of Models , author =. arXiv , year =

work page
[48]

Advances in Neural Information Processing Systems , volume=

Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

International conference on machine learning , pages=

Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[50]

arXiv preprint arXiv:2409.19998 , volume=

Do influence functions work on large language models , author=. arXiv preprint arXiv:2409.19998 , volume=

work page arXiv
[51]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021