arxiv: 2603.17368 · v2 · submitted 2026-03-18 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Jianan Chen , Zhifang Zhang , Shuo He , Linan Yue , Lei Feng , Minling Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords large reasoning modelschain-of-thoughtsafety alignmentauxiliary supervisionsafety decision-makingBERT classifierreasoning safety

0 comments

The pith

Large reasoning models regain safety when they are trained to decide on safety before starting chain-of-thought reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models lose safety once chain-of-thought is turned on, even though they stay safe when chain-of-thought is disabled. The paper shows this timing difference and proposes to fix it by inserting a safety decision step before reasoning begins. A BERT classifier first reads safety signals from a version of the model that never uses chain-of-thought. These signals are then added as extra training targets so the full model learns to make the same safety choices early. The result is stronger resistance to unsafe outputs while ordinary reasoning performance stays unchanged.

Core claim

Safety degradation in LRMs appears only after CoT generation is enabled. Extracting safety decision signals from a CoT-disabled safe model with a BERT classifier and adding them as auxiliary supervision allows safety gradients to reach the LRM's latent representations, strengthening safety decision-making before CoT starts and thereby improving overall safety without loss of reasoning ability.

What carries the argument

BERT-based classifier that pulls safety decision signals from a CoT-disabled model and supplies them as auxiliary supervision so safety gradients can reach latent states before CoT generation begins.

If this is right

Safety performance rises substantially on standard benchmarks.
Reasoning accuracy on math and logic tasks remains at the original level.
Safety gradients are successfully back-propagated into the model's internal representations.
The timing of safety decisions can be shifted earlier in the generation process without new side effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-decision idea could be tested on other reasoning techniques such as tree-of-thought or tool use.
Models might be trained from the start with an explicit pre-reasoning safety gate rather than added after the fact.
If the signals prove stable across model families, the method could become a standard pre-training step for any reasoning-enhanced system.

Load-bearing premise

Safety signals taken from a model that never runs chain-of-thought accurately capture the decisions the full model needs to make before it starts reasoning.

What would settle it

Train the same models with the BERT signals replaced by random labels and check whether safety scores return to the original degraded baseline while reasoning scores stay flat.

Figures

Figures reproduced from arXiv: 2603.17368 by Jianan Chen, Lei Feng, Linan Yue, Minling Zhang, Shuo He, Zhifang Zhang.

**Figure 2.** Figure 2: Comparisons among (a) Vanilla LRMs, (b) LRMs with safety alignment, and (c) LRMs with PreSafe (ours). [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of our method. (1) Extracting safety decision signals. We first extract the safety [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of reasoning capabilities across SafeChain, R2D, and PreSafe on AIME2024, Math-500 and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise update distribution induced by PreSafe on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1- [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance analysis of direct distillation. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of varying three key decoding sampling parameters (Temperature, Top-p, and Top-k) on attack success [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of CoT generation length on safety robustness for the DeepSeek-R1 series (DS-R1-7B/8B/14B). [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompts given to DeepSeek-V3.2. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompts given to evaluators. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper observes that safety degradation in large reasoning models (LRMs) occurs specifically after chain-of-thought (CoT) is enabled and is absent when CoT is disabled. It proposes a safety alignment method that trains a BERT classifier on a CoT-disabled safe model to extract safety decision signals, then incorporates these signals as auxiliary supervision during LRM alignment to strengthen pre-CoT safety decision-making. The central claim is that this approach substantially improves LRM safety while preserving general reasoning performance.

Significance. If the transfer of safety signals proves robust, the work would address a documented trade-off between reasoning capability and safety in LRMs by intervening at the latent decision stage before CoT generation. This could inform more targeted alignment strategies that avoid broad refusal degradation, provided the method generalizes beyond the specific models and datasets tested.

major comments (2)

[Methods (auxiliary supervision step)] The core assumption that BERT-extracted safety signals from a CoT-disabled model faithfully encode the latent states needed before CoT generation in the target LRM is load-bearing but untested directly. No classifier accuracy on CoT-enabled traces, ablation of auxiliary loss weight, or pre/post-supervision activation comparisons are reported, leaving open the possibility that the signals capture only surface refusal patterns orthogonal to the documented CoT-induced degradation.
[Experiments] The experimental section asserts substantial safety gains with maintained reasoning but supplies no quantitative details on safety metrics (e.g., harm scores, refusal rates), baselines, or ablation studies in the provided description. Without these, the effect size and robustness of the headline claim cannot be evaluated.

minor comments (2)

Clarify the exact training procedure and dataset for the BERT classifier, including how safety labels are obtained from the CoT-disabled model, to support reproducibility.
Add explicit comparison tables showing safety and reasoning metrics before and after the proposed alignment, with statistical significance where appropriate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our methods and results. We address each major comment below and will revise the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses

Referee: The core assumption that BERT-extracted safety signals from a CoT-disabled model faithfully encode the latent states needed before CoT generation in the target LRM is load-bearing but untested directly. No classifier accuracy on CoT-enabled traces, ablation of auxiliary loss weight, or pre/post-supervision activation comparisons are reported, leaving open the possibility that the signals capture only surface refusal patterns orthogonal to the documented CoT-induced degradation.

Authors: We agree that direct empirical validation of the signals' alignment with pre-CoT latent states would make the claims more robust. In the revised version, we will add: (1) the BERT classifier's accuracy when evaluated on CoT-enabled reasoning traces from the target LRM; (2) an ablation study varying the weight of the auxiliary supervision loss; and (3) comparisons of hidden-state activations before and after applying the auxiliary supervision to show influence on early decision layers. We maintain that the signals are not limited to surface patterns, as they are extracted from a CoT-disabled safe model where safety remains intact precisely at the pre-CoT stage identified in our core observation; however, the requested experiments will provide stronger evidence. revision: yes
Referee: The experimental section asserts substantial safety gains with maintained reasoning but supplies no quantitative details on safety metrics (e.g., harm scores, refusal rates), baselines, or ablation studies in the provided description. Without these, the effect size and robustness of the headline claim cannot be evaluated.

Authors: The full manuscript contains quantitative results on standard safety benchmarks, including harm scores and refusal rates, with comparisons against baselines such as vanilla safety fine-tuning and CoT-disabled variants. To address the concern, we will expand the experimental section in the revision to prominently feature these metrics, effect sizes, and additional ablation studies isolating the contribution of the auxiliary supervision. This will allow clearer evaluation of the method's impact on safety versus reasoning trade-offs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a training procedure that extracts safety signals from a separate CoT-disabled model via an external BERT classifier and uses them as auxiliary supervision during alignment. No equations, derivations, or self-citations are shown that reduce the claimed safety improvement to a fitted parameter or definition of the target result itself. The method is self-contained against external benchmarks (classifier trained independently, standard alignment losses) and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the unverified transferability of safety signals from CoT-disabled models and the assumption that auxiliary supervision strengthens latent safety representations; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Safety degradation in LRMs occurs only after CoT is enabled
Stated as the key observation that motivates the entire method

pith-pipeline@v0.9.0 · 5506 in / 1119 out tokens · 41105 ms · 2026-05-15T09:30:03.514002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first utilize a Bert-based classifier to extract safety decision signals from a safe model ... integrate these signals into LRMs' safety alignment as auxiliary supervision ... Lalign(θ, ϕ) = 1/M Σ ℓBCE(pj, p'j)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

safety degradation occurs only after CoT is enabled ... promote safety decision-making before CoT generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 13 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking

Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

work page 2025
[6]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

work page 2024
[7]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,

work page arXiv
[12]

Chain-of-thought hijacking,

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking.arXiv preprint arXiv:2510.26418,

work page arXiv
[13]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186,

work page 2019
[14]

Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

work page arXiv
[15]

Smoothllm: Defending large language models against jailbreaking attacks.Trans

Alexander Robey, Eric Wong, Hamed Hassani, and George Pappas. Smoothllm: Defending large language models against jailbreaking attacks.Trans. Mach. Learn. Res., 2025,

work page 2025
[16]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeek-V3 Technical Report

11 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, and others. Deepseek-v3...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A.1 Teacher and Student Model Mapping As mentioned in the methodology, we utilize the CoT-OFF state of the model itself or a superior model as the teacher

12 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT Appendix A Experimental Setup Details In this section, we provide the specific implementation details referenced in Section 4 to ensure reproducibility. A.1 Teacher and Student Model Mapping As mentioned in the methodology, we utilize the CoT-OFF state of the model itself or a...

work page 2025
[22]

C.2 LLM-as-a-Judge We selected four mainstream models as candidates: GPT-4o-mini [Hurst et al., 2024], Gemini-2.5-Flash [Comanici et al., 2025], DeepSeek-V3.2 [Liu et al., 2025] and Qwen3-Max [Yang et al., 2025]. To assess their performance, we utilized three reasoning models (DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, and Skywork-OR1-7B) ...

work page 2024
[23]

D.1 CoT-OFF Capability Gap Although the introduction motivates PreSafe by the safety potential of a model’s CoT-OFF state, an important practical concern is whether disabling CoT generation preserves the model’s general capabilities. Table 7 answers this question: 15 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT across reaso...

work page 2023
[24]

Refusal but then instructions

Nevertheless, we consider it crucial to compare our approach with recent advancements. To this end, we select SafePath [Jeung et al., 2025] and UnsafeChain [Tomar et al., 2025] as comparative baselines. We evaluate these methods alongside our proposed approach on DeepSeek-R1-Distill-Qwen-7B. 16 PreSafe: Safety Decision-Making before Chain-of-Thought Gener...

work page 2025