pith. machine review for the scientific record. sign in

arxiv: 2510.21285 · v4 · submitted 2025-10-24 · 💻 cs.AI · cs.CL

Recognition: unknown

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Authors on Pith no claims yet
classification 💻 cs.AI cs.CL
keywords reasoningsafetymodelslrmsself-jailbreakwhileexistingfailures
0
0 comments X
read the original abstract

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail(CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.