Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.
citing papers explorer
-
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.
-
Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
-
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
-
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.