pith. machine review for the scientific record. sign in

arxiv: 2605.14194 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

GradShield: Alignment Preserving Finetuning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords alignment preservationfinetuning safetyharmful data filteringLLM safetyimplicit harmfulnessadaptive thresholding
0
0 comments X

The pith

GradShield filters data points by their implicit harmfulness score to keep finetuned LLMs aligned while retaining utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a filtering technique that scores each finetuning example for how likely it is to push the model toward unsafe behaviors, even when the data looks benign. It then applies an adaptive threshold to drop the riskiest examples before training starts. This keeps the attack success rate low across multiple tasks and levels of contamination without hurting the model's performance on intended objectives. A reader would care because finetuning often erodes the safety properties built into base models, and a lightweight pre-filter offers a direct way to protect alignment during adaptation.

Core claim

GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for every data point and removes those above an adaptive threshold, producing models whose attack success rate stays below 6 percent on safety benchmarks while utility metrics remain comparable to models trained on unfiltered data.

What carries the argument

The Finetuning Implicit Harmfulness Score (FIHS), which quantifies the expected contribution of a single training example to post-finetuning misalignment.

If this is right

  • Finetuning runs can include more diverse data sources without separate safety post-processing steps.
  • The same scoring approach can be applied to different base models and task distributions.
  • Utility degradation from aggressive filtering is avoided because the threshold adapts to the observed score distribution.
  • Models remain resistant to both explicit and implicit harmful examples in the training mix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be inserted as a preprocessing stage before any existing alignment or safety-tuning pipeline.
  • Extending the score to multimodal or instruction-following data might protect against new forms of misalignment.
  • If the score correlates with gradient directions that increase harmful output probability, it might generalize to other optimization objectives beyond standard language modeling.

Load-bearing premise

That the FIHS value for a data point reliably predicts whether including it will increase the model's vulnerability to attacks after finetuning.

What would settle it

A controlled experiment that adds only high-FIHS examples to a safe dataset and measures whether the resulting model shows no rise in attack success rate compared with the unfiltered baseline.

Figures

Figures reproduced from arXiv: 2605.14194 by Basel Alomair, David Wagner, Emad A. Alghamdi, Patrick Mendoza, Raluca Ada Popa, Xiao Huang, Zhanhao Hu.

Figure 1
Figure 1. Figure 1: GradShield is well-suited for defending API finetuning. It protects the safety alignment of [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of FIHS scores of utility and harmfulness datasets. (a) FIHS scores. (b) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GradShield, a filtering method for LLM finetuning that computes a Finetuning Implicit Harmfulness Score (FIHS) per data point via gradient signals from a proxy step and applies adaptive thresholding to remove points likely to cause misalignment. It evaluates the approach on multiple utility finetuning tasks with varying levels of harmful data, claiming that the resulting models achieve ASR below 6% while preserving utility and outperforming baselines.

Significance. If the central claim holds, GradShield would offer a practical, gradient-based safeguard for maintaining alignment during finetuning, addressing a real vulnerability in post-training safety. The empirical results across harmful-data regimes and the focus on both safety and utility metrics would make it a useful contribution to the alignment literature, particularly if the method proves more precise than volume-based filtering.

major comments (3)
  1. [Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.
  2. [Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.
  3. [Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'parameter-free' for the thresholding step, but the adaptive algorithm description implies at least one tunable sensitivity parameter; clarify the exact claim.
  2. [Figure 2] Figure 2 (ASR vs. harmful-data ratio) lacks error bars or statistical significance tests across the multiple runs; add these to support the 'consistently below 6%' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and empirical rigor. We will revise the manuscript to address each point as detailed below.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (FIHS definition and computation): The manuscript provides no equations, pseudocode, or implementation details for how the gradient signals are aggregated into the FIHS score or for the proxy finetuning step used to compute it. Without these, it is impossible to verify whether the reported ASR reductions are produced by the claimed mechanism or by an unstated hyperparameter choice.

    Authors: We agree that the current version lacks the necessary implementation details for full reproducibility and verification. In the revised manuscript, we will add the full mathematical definition of the Finetuning Implicit Harmfulness Score (FIHS), including the gradient computation from the proxy finetuning step, the aggregation formula across tokens or layers, the adaptive thresholding procedure, and accompanying pseudocode. This will explicitly show how the score is derived and allow independent verification of the mechanism. revision: yes

  2. Referee: [Section 4] Section 4 (Experimental evaluation): No ablation is reported that compares GradShield against random or size-matched filtering at equivalent data-retention rates. The central claim that FIHS specifically identifies harmful points therefore rests only on post-filtering aggregate ASR; the observed <6% ASR could be explained by aggressive data reduction rather than accurate harm detection.

    Authors: We acknowledge that the absence of a direct comparison to random or size-matched filtering at matched retention rates leaves open the possibility that safety gains stem from data volume reduction alone. We will add this ablation in the revised Section 4, evaluating GradShield against random sampling and size-matched baselines at identical retention rates across all harmful-data regimes. The results will be reported with the same ASR and utility metrics to isolate the contribution of the FIHS-based selection. revision: yes

  3. Referee: [Section 4.2] Section 4.2 (Correlation analysis): The paper reports only aggregate safety and utility metrics across harmful-data levels and does not provide per-point correlation between FIHS scores and actual misalignment risk (e.g., via controlled retention experiments). This leaves the weakest assumption—that FIHS ranks data by misalignment potential—without direct empirical support.

    Authors: We agree that aggregate metrics alone do not directly validate the per-point ranking assumption. In the revision, we will include controlled retention experiments that sort data by FIHS score and measure ASR as a function of the retention threshold (e.g., retaining the lowest-FIHS fraction). We will also report Spearman or Pearson correlation between FIHS values and per-point misalignment indicators where feasible, providing direct evidence that higher FIHS scores correspond to greater misalignment risk. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GradShield's FIHS-based filtering

full rationale

The paper defines FIHS via gradient-based computation on a proxy finetuning step and applies adaptive thresholding to remove points before the main finetuning. No equation or step reduces the reported ASR (<6%) or utility metrics back to the FIHS scores by construction, nor does any central claim rest on a self-citation chain or fitted parameter renamed as prediction. The evaluation is empirical across multiple tasks and harm levels, with FIHS operating independently of the final safety metrics. This matches the default expectation of a non-circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unproven premise that gradient signals during early finetuning steps can reliably predict post-training misalignment; no independent evidence for this mapping is supplied in the abstract.

axioms (1)
  • domain assumption Gradient-based signals computed on a data point during finetuning correlate with the degree of safety misalignment the point will induce.
    Invoked to justify the FIHS as a filtering criterion.
invented entities (1)
  • Finetuning Implicit Harmfulness Score (FIHS) no independent evidence
    purpose: Quantify the latent harmfulness of a training example with respect to model alignment.
    New metric introduced by the paper; no external validation or falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5463 in / 1270 out tokens · 39443 ms · 2026-05-15T04:40:50.095750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  5. [5]

    2023 , eprint=

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    Robustifying Safety-Aligned Large Language Models through Clean Data Curation , author=. 2024 , eprint=

  7. [7]

    2024 , eprint=

    Representation Noising: A Defence Mechanism Against Harmful Finetuning , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation , author=. 2025 , eprint=

  9. [9]

    2024 , eprint=

    Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=

  10. [10]

    2024 , eprint=

    Fine-tuning can cripple your foundation model; preserving features may be the solution , author=. 2024 , eprint=

  11. [11]

    2024 , eprint=

    Safety Alignment Should Be Made More Than Just a Few Tokens Deep , author=. 2024 , eprint=

  12. [12]

    2024 , eprint=

    Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions , author=. 2024 , eprint=

  13. [13]

    2024 , eprint=

    Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , eprint=

  14. [14]

    2025 , eprint=

    SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey , author=. 2024 , eprint=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Toxicity detection for free , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    2024 , eprint=

    NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning , author=. 2024 , eprint=

  18. [18]

    2025 , eprint=

    Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models , author=. 2025 , eprint=

  19. [19]

    D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

    Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.449

  20. [20]

    2016 , eprint=

    Character-level Convolutional Networks for Text Classification , author=. 2016 , eprint=

  21. [21]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  22. [22]

    arXiv preprint arXiv:2407.15549 , year=

    Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs , author=. arXiv preprint arXiv:2407.15549 , year=

  23. [23]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  24. [24]

    arXiv preprint arXiv:2408.09600 , year=

    Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning , author=. arXiv preprint arXiv:2408.09600 , year=

  25. [25]

    arXiv preprint arXiv:2507.21182 , year=

    SDD: Self-Degraded Defense against Malicious Fine-tuning , author=. arXiv preprint arXiv:2507.21182 , year=

  26. [26]

    2025 , eprint=

    Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. 2025 , eprint=

  28. [28]

    2024 , eprint=

    What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , eprint=

  29. [29]

    2024 , eprint=

    Removing RLHF Protections in GPT-4 via Fine-Tuning , author=. 2024 , eprint=

  30. [30]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  31. [31]

    URL http: //dx.doi.org/10.18653/v1/D19-5409

    Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander. SAMS um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization. 2019. doi:10.18653/v1/D19-5409

  32. [32]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  33. [33]

    arXiv preprint arXiv:2410.10014 , year=

    Safety-aware fine-tuning of large language models , author=. arXiv preprint arXiv:2410.10014 , year=

  34. [34]

    2022 , eprint=

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=

  35. [35]

    2024 , eprint=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

  36. [36]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  37. [37]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  38. [38]

    2023 , eprint=

    PAL: Program-aided Language Models , author=. 2023 , eprint=

  39. [39]

    2023 , eprint=

    Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , author=. 2023 , eprint=

  40. [40]

    2024 , eprint=

    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author=. 2024 , eprint=

  41. [41]

    2024 , eprint=

    Safety-Aware Fine-Tuning of Large Language Models , author=. 2024 , eprint=

  42. [42]

    2024 , eprint=

    SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection , author=. 2024 , eprint=

  43. [43]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  44. [44]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  45. [45]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    LLaMA 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv , year =. doi:10.48550/arXiv.2307.09288 , url =

  47. [47]

    arXiv , year =

    The LLaMA 3 Herd of Models , author =. arXiv , year =

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Estimating training data influence by tracing gradient descent , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    International conference on machine learning , pages=

    Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

  50. [50]

    arXiv preprint arXiv:2409.19998 , volume=

    Do influence functions work on large language models , author=. arXiv preprint arXiv:2409.19998 , volume=

  51. [51]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=