AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Hailiang Dai; Jiayu Yao; Jigang Wang; Kunpeng Ning; Li Yuan; Qihui Zhang; Shuo Yang; Xiaojun Jia; Yibing Song; Yuyang Liu

arxiv: 2506.08473 · v4 · pith:5MN5WHBGnew · submitted 2025-06-10 · 💻 cs.LG

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang , Qihui Zhang , Yuyang Liu , Xiaojun Jia , Kunpeng Ning , Jiayu Yao , Jigang Wang , Hailiang Dai

show 2 more authors

Yibing Song Li Yuan

This is my paper

classification 💻 cs.LG

keywords safetyasftfine-tuningalignmentbasindirectionmodelsnarrow

0 comments

read the original abstract

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
cs.CV 2025-10 unverdicted novelty 6.0

UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.