Where Do Reasoning Models Refuse?

· 2025 · cs.CL · arXiv 2507.03167

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.

representative citing papers

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training cs.LG · 2026-05-27 · unverdicted · none · ref 18 · internal anchor
Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.

Where Do Reasoning Models Refuse?

fields

years

verdicts

representative citing papers

citing papers explorer