Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
Shadow alignment: The ease of subverting safely-aligned language models, 2023
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
- Persona-Model Collapse in Emergent Misalignment