Recognition: no theorem link
GradShield: Alignment Preserving Finetuning
Pith reviewed 2026-05-15 04:40 UTC · model grok-4.3
The pith
GradShield filters data points by their implicit harmfulness score to keep finetuned LLMs aligned while retaining utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for every data point and removes those above an adaptive threshold, producing models whose attack success rate stays below 6 percent on safety benchmarks while utility metrics remain comparable to models trained on unfiltered data.
What carries the argument
The Finetuning Implicit Harmfulness Score (FIHS), which quantifies the expected contribution of a single training example to post-finetuning misalignment.
If this is right
- Finetuning runs can include more diverse data sources without separate safety post-processing steps.
- The same scoring approach can be applied to different base models and task distributions.
- Utility degradation from aggressive filtering is avoided because the threshold adapts to the observed score distribution.
- Models remain resistant to both explicit and implicit harmful examples in the training mix.
Where Pith is reading between the lines
- The method could be inserted as a preprocessing stage before any existing alignment or safety-tuning pipeline.
- Extending the score to multimodal or instruction-following data might protect against new forms of misalignment.
- If the score correlates with gradient directions that increase harmful output probability, it might generalize to other optimization objectives beyond standard language modeling.
Load-bearing premise
That the FIHS value for a data point reliably predicts whether including it will increase the model's vulnerability to attacks after finetuning.
What would settle it
A controlled experiment that adds only high-FIHS examples to a safe dataset and measures whether the resulting model shows no rise in attack success rate compared with the unfiltered baseline.
Figures
read the original abstract
Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GradShield, a filtering method for LLM finetuning that computes a Finetuning Implicit Harmfulness Score (FIHS) per data point via gradient signals from a proxy step and applies adaptive thresholding to remove points likely to cause misalignment. It evaluates the approach on multiple utility finetuning tasks with varying levels of harmful data, claiming that the resulting models achieve ASR below 6% while preserving utility and outperforming baselines.
Significance. If the central claim holds, GradShield would offer a practical, gradient-based safeguard for maintaining alignment during finetuning, addressing a real vulnerability in post-training safety. The empirical results across harmful-data regimes and the focus on both safety and utility metrics would make it a useful contribution to the alignment literature, particularly if the method proves more precise than volume-based filtering.
major comments (3)
- [Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.
- [Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.
- [Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.
minor comments (2)
- [Abstract] The abstract and introduction use the term 'parameter-free' for the thresholding step, but the adaptive algorithm description implies at least one tunable sensitivity parameter; clarify the exact claim.
- [Figure 2] Figure 2 (ASR vs. harmful-data ratio) lacks error bars or statistical significance tests across the multiple runs; add these to support the 'consistently below 6%' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and empirical rigor. We will revise the manuscript to address each point as detailed below.
read point-by-point responses
-
Referee: [Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.
Authors: We agree that the current version lacks the necessary implementation details for full reproducibility and verification. In the revised manuscript, we will add the full mathematical definition of the Finetuning Implicit Harmfulness Score (FIHS), including the gradient computation from the proxy finetuning step, the aggregation formula across tokens or layers, the adaptive thresholding procedure, and accompanying pseudocode. This will explicitly show how the score is derived and allow independent verification of the mechanism. revision: yes
-
Referee: [Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.
Authors: We acknowledge that the absence of a direct comparison to random or size-matched filtering at matched retention rates leaves open the possibility that safety gains stem from data volume reduction alone. We will add this ablation in the revised Section 4, evaluating GradShield against random sampling and size-matched baselines at identical retention rates across all harmful-data regimes. The results will be reported with the same ASR and utility metrics to isolate the contribution of the FIHS-based selection. revision: yes
-
Referee: [Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.
Authors: We agree that aggregate metrics alone do not directly validate the per-point ranking assumption. In the revision, we will include controlled retention experiments that sort data by FIHS score and measure ASR as a function of the retention threshold (e.g., retaining the lowest-FIHS fraction). We will also report Spearman or Pearson correlation between FIHS values and per-point misalignment indicators where feasible, providing direct evidence that higher FIHS scores correspond to greater misalignment risk. revision: yes
Circularity Check
No significant circularity in GradShield's FIHS-based filtering
full rationale
The paper defines FIHS via gradient-based computation on a proxy finetuning step and applies adaptive thresholding to remove points before the main finetuning. No equation or step reduces the reported ASR (<6%) or utility metrics back to the FIHS scores by construction, nor does any central claim rest on a self-citation chain or fitted parameter renamed as prediction. The evaluation is empirical across multiple tasks and harm levels, with FIHS operating independently of the final safety metrics. This matches the default expectation of a non-circular derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient-based signals computed on a data point during finetuning correlate with the degree of safety misalignment the point will induce.
invented entities (1)
-
Finetuning Implicit Harmfulness Score (FIHS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
work page 2022
-
[5]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. 2023 , eprint=
work page 2023
-
[6]
Robustifying Safety-Aligned Large Language Models through Clean Data Curation , author=. 2024 , eprint=
work page 2024
-
[7]
Representation Noising: A Defence Mechanism Against Harmful Finetuning , author=. 2024 , eprint=
work page 2024
-
[8]
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation , author=. 2025 , eprint=
work page 2025
-
[9]
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=
work page 2024
-
[10]
Fine-tuning can cripple your foundation model; preserving features may be the solution , author=. 2024 , eprint=
work page 2024
-
[11]
Safety Alignment Should Be Made More Than Just a Few Tokens Deep , author=. 2024 , eprint=
work page 2024
-
[12]
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions , author=. 2024 , eprint=
work page 2024
-
[13]
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=
work page 2024
-
[14]
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation , author=. 2025 , eprint=
work page 2025
-
[15]
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey , author=. 2024 , eprint=
work page 2024
-
[16]
Advances in Neural Information Processing Systems , volume=
Toxicity detection for free , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning , author=. 2024 , eprint=
work page 2024
-
[18]
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models , author=. 2025 , eprint=
work page 2025
-
[19]
D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset
Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.449
-
[20]
Character-level Convolutional Networks for Text Classification , author=. 2016 , eprint=
work page 2016
-
[21]
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[22]
arXiv preprint arXiv:2407.15549 , year=
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs , author=. arXiv preprint arXiv:2407.15549 , year=
-
[23]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[24]
arXiv preprint arXiv:2408.09600 , year=
Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning , author=. arXiv preprint arXiv:2408.09600 , year=
-
[25]
arXiv preprint arXiv:2507.21182 , year=
SDD: Self-Degraded Defense against Malicious Fine-tuning , author=. arXiv preprint arXiv:2507.21182 , year=
-
[26]
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency , author=. 2025 , eprint=
work page 2025
-
[27]
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. 2025 , eprint=
work page 2025
-
[28]
What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , eprint=
work page 2024
-
[29]
Removing RLHF Protections in GPT-4 via Fine-Tuning , author=. 2024 , eprint=
work page 2024
-
[30]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[31]
URL http: //dx.doi.org/10.18653/v1/D19-5409
Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander. SAMS um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization. 2019. doi:10.18653/v1/D19-5409
-
[32]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
arXiv preprint arXiv:2410.10014 , year=
Safety-aware fine-tuning of large language models , author=. arXiv preprint arXiv:2410.10014 , year=
-
[34]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=
work page 2022
-
[35]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=
work page 2024
- [36]
- [37]
- [38]
-
[39]
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , author=. 2023 , eprint=
work page 2023
-
[40]
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author=. 2024 , eprint=
work page 2024
-
[41]
Safety-Aware Fine-Tuning of Large Language Models , author=. 2024 , eprint=
work page 2024
-
[42]
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection , author=. 2024 , eprint=
work page 2024
-
[43]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[44]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[45]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
LLaMA 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv , year =. doi:10.48550/arXiv.2307.09288 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288
- [47]
-
[48]
Advances in Neural Information Processing Systems , volume=
Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
International conference on machine learning , pages=
Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[50]
arXiv preprint arXiv:2409.19998 , volume=
Do influence functions work on large language models , author=. arXiv preprint arXiv:2409.19998 , volume=
-
[51]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.