Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.
Safety in large reason- ing models: A survey
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
citing papers explorer
-
Addressing Over-Refusal in LLMs with Competing Rewards
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.