Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
emnlp-main.1082/
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.