Recognition: 2 theorem links
· Lean TheoremTowards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Pith reviewed 2026-05-15 09:30 UTC · model grok-4.3
The pith
Large reasoning models regain safety when they are trained to decide on safety before starting chain-of-thought reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety degradation in LRMs appears only after CoT generation is enabled. Extracting safety decision signals from a CoT-disabled safe model with a BERT classifier and adding them as auxiliary supervision allows safety gradients to reach the LRM's latent representations, strengthening safety decision-making before CoT starts and thereby improving overall safety without loss of reasoning ability.
What carries the argument
BERT-based classifier that pulls safety decision signals from a CoT-disabled model and supplies them as auxiliary supervision so safety gradients can reach latent states before CoT generation begins.
If this is right
- Safety performance rises substantially on standard benchmarks.
- Reasoning accuracy on math and logic tasks remains at the original level.
- Safety gradients are successfully back-propagated into the model's internal representations.
- The timing of safety decisions can be shifted earlier in the generation process without new side effects.
Where Pith is reading between the lines
- The same early-decision idea could be tested on other reasoning techniques such as tree-of-thought or tool use.
- Models might be trained from the start with an explicit pre-reasoning safety gate rather than added after the fact.
- If the signals prove stable across model families, the method could become a standard pre-training step for any reasoning-enhanced system.
Load-bearing premise
Safety signals taken from a model that never runs chain-of-thought accurately capture the decisions the full model needs to make before it starts reasoning.
What would settle it
Train the same models with the BERT signals replaced by random labels and check whether safety scores return to the original degraded baseline while reasoning scores stay flat.
Figures
read the original abstract
Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes that safety degradation in large reasoning models (LRMs) occurs specifically after chain-of-thought (CoT) is enabled and is absent when CoT is disabled. It proposes a safety alignment method that trains a BERT classifier on a CoT-disabled safe model to extract safety decision signals, then incorporates these signals as auxiliary supervision during LRM alignment to strengthen pre-CoT safety decision-making. The central claim is that this approach substantially improves LRM safety while preserving general reasoning performance.
Significance. If the transfer of safety signals proves robust, the work would address a documented trade-off between reasoning capability and safety in LRMs by intervening at the latent decision stage before CoT generation. This could inform more targeted alignment strategies that avoid broad refusal degradation, provided the method generalizes beyond the specific models and datasets tested.
major comments (2)
- [Methods (auxiliary supervision step)] The core assumption that BERT-extracted safety signals from a CoT-disabled model faithfully encode the latent states needed before CoT generation in the target LRM is load-bearing but untested directly. No classifier accuracy on CoT-enabled traces, ablation of auxiliary loss weight, or pre/post-supervision activation comparisons are reported, leaving open the possibility that the signals capture only surface refusal patterns orthogonal to the documented CoT-induced degradation.
- [Experiments] The experimental section asserts substantial safety gains with maintained reasoning but supplies no quantitative details on safety metrics (e.g., harm scores, refusal rates), baselines, or ablation studies in the provided description. Without these, the effect size and robustness of the headline claim cannot be evaluated.
minor comments (2)
- Clarify the exact training procedure and dataset for the BERT classifier, including how safety labels are obtained from the CoT-disabled model, to support reproducibility.
- Add explicit comparison tables showing safety and reasoning metrics before and after the proposed alignment, with statistical significance where appropriate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our methods and results. We address each major comment below and will revise the manuscript to incorporate additional analyses and clarifications.
read point-by-point responses
-
Referee: The core assumption that BERT-extracted safety signals from a CoT-disabled model faithfully encode the latent states needed before CoT generation in the target LRM is load-bearing but untested directly. No classifier accuracy on CoT-enabled traces, ablation of auxiliary loss weight, or pre/post-supervision activation comparisons are reported, leaving open the possibility that the signals capture only surface refusal patterns orthogonal to the documented CoT-induced degradation.
Authors: We agree that direct empirical validation of the signals' alignment with pre-CoT latent states would make the claims more robust. In the revised version, we will add: (1) the BERT classifier's accuracy when evaluated on CoT-enabled reasoning traces from the target LRM; (2) an ablation study varying the weight of the auxiliary supervision loss; and (3) comparisons of hidden-state activations before and after applying the auxiliary supervision to show influence on early decision layers. We maintain that the signals are not limited to surface patterns, as they are extracted from a CoT-disabled safe model where safety remains intact precisely at the pre-CoT stage identified in our core observation; however, the requested experiments will provide stronger evidence. revision: yes
-
Referee: The experimental section asserts substantial safety gains with maintained reasoning but supplies no quantitative details on safety metrics (e.g., harm scores, refusal rates), baselines, or ablation studies in the provided description. Without these, the effect size and robustness of the headline claim cannot be evaluated.
Authors: The full manuscript contains quantitative results on standard safety benchmarks, including harm scores and refusal rates, with comparisons against baselines such as vanilla safety fine-tuning and CoT-disabled variants. To address the concern, we will expand the experimental section in the revision to prominently feature these metrics, effect sizes, and additional ablation studies isolating the contribution of the auxiliary supervision. This will allow clearer evaluation of the method's impact on safety versus reasoning trade-offs. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a training procedure that extracts safety signals from a separate CoT-disabled model via an external BERT classifier and uses them as auxiliary supervision during alignment. No equations, derivations, or self-citations are shown that reduce the claimed safety improvement to a fitted parameter or definition of the target result itself. The method is self-contained against external benchmarks (classifier trained independently, standard alignment losses) and does not rely on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety degradation in LRMs occurs only after CoT is enabled
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we first utilize a Bert-based classifier to extract safety decision signals from a safe model ... integrate these signals into LRMs' safety alignment as auxiliary supervision ... Lalign(θ, ϕ) = 1/M Σ ℓBCE(pj, p'j)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
safety degradation occurs only after CoT is enabled ... promote safety decision-making before CoT generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking
Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,
work page 2025
-
[6]
American invitational mathematics examination (aime) 2024,
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,
work page 2024
-
[7]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,
-
[12]
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking.arXiv preprint arXiv:2510.26418,
-
[13]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186,
work page 2019
-
[14]
Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
-
[15]
Smoothllm: Defending large language models against jailbreaking attacks.Trans
Alexander Robey, Eric Wong, Hamed Hassani, and George Pappas. Smoothllm: Defending large language models against jailbreaking attacks.Trans. Mach. Learn. Res., 2025,
work page 2025
-
[16]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
11 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, and others. Deepseek-v3...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
12 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT Appendix A Experimental Setup Details In this section, we provide the specific implementation details referenced in Section 4 to ensure reproducibility. A.1 Teacher and Student Model Mapping As mentioned in the methodology, we utilize the CoT-OFF state of the model itself or a...
work page 2025
-
[22]
C.2 LLM-as-a-Judge We selected four mainstream models as candidates: GPT-4o-mini [Hurst et al., 2024], Gemini-2.5-Flash [Comanici et al., 2025], DeepSeek-V3.2 [Liu et al., 2025] and Qwen3-Max [Yang et al., 2025]. To assess their performance, we utilized three reasoning models (DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, and Skywork-OR1-7B) ...
work page 2024
-
[23]
D.1 CoT-OFF Capability Gap Although the introduction motivates PreSafe by the safety potential of a model’s CoT-OFF state, an important practical concern is whether disabling CoT generation preserves the model’s general capabilities. Table 7 answers this question: 15 PreSafe: Safety Decision-Making before Chain-of-Thought GenerationA PREPRINT across reaso...
work page 2023
-
[24]
Nevertheless, we consider it crucial to compare our approach with recent advancements. To this end, we select SafePath [Jeung et al., 2025] and UnsafeChain [Tomar et al., 2025] as comparative baselines. We evaluate these methods alongside our proposed approach on DeepSeek-R1-Distill-Qwen-7B. 16 PreSafe: Safety Decision-Making before Chain-of-Thought Gener...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.