Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Internalizing Safety Understanding in Large Reasoning Models via Verification
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.