Systematic experiments reveal that activation steering trades fluency for concept control, is less effective on instruction-tuned models, and that prompting/SFT excel at injection but not removal, with textual metrics correlating to LLM judges.
arXiv preprint arXiv:2408.06223 , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Safe-RULE introduces a reinforcement unlearning defense for offline safe RL that counters data poisoning by removing malicious data influence while preserving task performance and safety.
citing papers explorer
-
Safe-RULE: Safe Reinforcement UnLEarning
Safe-RULE introduces a reinforcement unlearning defense for offline safe RL that counters data poisoning by removing malicious data influence while preserving task performance and safety.