pith. sign in

arxiv: 2606.03648 · v1 · pith:7NQFGQJOnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

classification 💻 cs.CL cs.AI
keywords safetyfine-tuningcapabilityeffectsmodelconclusionsfine-tunedincoherent
0
0 comments X
read the original abstract

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.