Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.
Where Do Reasoning Models Refuse?
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.