Safety in large reason- ing models: A survey

23 Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, Bryan Hooi · 2025 · arXiv 2504.01903

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

Internalizing Safety Understanding in Large Reasoning Models via Verification

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

Reasoning Structure Matters for Safety Alignment of Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

cs.AI · 2026-03-18 · unverdicted · novelty 5.0

CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.

citing papers explorer

Showing 5 of 5 citing papers.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models cs.AI · 2025-10-24 · unverdicted · none · ref 15
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories cs.AI · 2026-05-09 · unverdicted · none · ref 28
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 20
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
Reasoning Structure Matters for Safety Alignment of Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 24
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 34
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.

Safety in large reason- ing models: A survey

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer