Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

· 2025 · arXiv 2512.23260

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.

citing papers explorer

Showing 1 of 1 citing paper.

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models cs.LG · 2026-04-20 · unverdicted · none · ref 38
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.

Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

fields

years

verdicts

representative citing papers

citing papers explorer