Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

· 2026 · cs.AI · arXiv 2603.17368

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.

representative citing papers

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

cs.CR · 2026-06-06 · unverdicted · novelty 6.0

POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.

citing papers explorer

Showing 1 of 1 citing paper after filters.

POISE: Position-Aware Undetectable Skill Injection on LLM Agents cs.CR · 2026-06-06 · unverdicted · none · ref 42 · internal anchor
POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

fields

years

verdicts

representative citing papers

citing papers explorer