Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
arXiv preprint arXiv:2411.09003 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
citing papers explorer
-
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
-
Expert-Aware Refusal Steering
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
-
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
-
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.
-
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.