pith. sign in

arxiv: 2505.15404 · v2 · submitted 2025-05-21 · 💻 cs.CL

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Pith reviewed 2026-05-22 14:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords large reasoning modelssafety enhancementsupervised fine-tuningdata distillationrisky patternsreasoning processesablation study
0
0 comments X

The pith

Addressing five key risky patterns during data distillation substantially improves safety in Large Reasoning Models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large Reasoning Models achieve strong results on tasks like mathematics and programming, yet their reasoning power does not automatically produce safer behavior. Direct distillation of safe responses from a model such as DeepSeek-R1 yields little safety gain because five specific risky patterns appear in the process. Explicitly correcting those patterns during distillation produces clear safety improvements. The study also finds that long and complex reasoning chains are not required, since short or template-based reasoning reaches comparable safety levels. Ablation experiments map how different supervised fine-tuning choices affect the final safety outcome.

Core claim

The authors establish that Large Reasoning Models can be made substantially safer through supervised fine-tuning when five risky patterns are identified and removed from the data distillation pipeline originating from DeepSeek-R1, while showing that lengthy reasoning traces are unnecessary for safety and that shorter or templated reasoning suffices.

What carries the argument

The five key risky patterns identified in the data distillation process for supervised fine-tuning of Large Reasoning Models.

If this is right

  • Safety performance improves substantially once the five risky patterns are explicitly handled in the distillation data.
  • Short or template-based reasoning processes deliver safety results comparable to those from long complex reasoning.
  • Ablation results show that specific training configuration choices directly influence achieved safety levels.
  • Safety gains from this approach do not require increasing the length or complexity of the model's reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training for reasoning models may be separable from efforts to strengthen their reasoning capabilities.
  • The same pattern-addressing method could be tested on safety distillation for non-reasoning language models.
  • Safety evaluations may need to inspect reasoning traces themselves rather than only final outputs.

Load-bearing premise

The safety benchmarks and evaluation protocols used accurately capture real-world harmful behaviors and are not overly sensitive to superficial changes in output style.

What would settle it

Observing no safety improvement on held-out harmful queries when the five risky patterns are addressed would falsify the central claim.

read the original abstract

Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how should we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify five key risky patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we conduct a comprehensive ablation study to reveal the impact of different training configurations. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in https://github.com/thu-coai/LRM-Safety-Study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study on enhancing the safety of Large Reasoning Models (LRMs) via Supervised Fine-Tuning (SFT). It reports that direct distillation of safe responses from DeepSeek-R1 fails to yield significant safety gains, identifies five key risky patterns in the distillation process as the cause, demonstrates that explicitly mitigating these patterns produces substantial safety improvements, shows that short or template-based reasoning achieves comparable safety performance, and includes ablation studies on training configurations. Code and data are released.

Significance. If the empirical results hold after addressing evaluation controls, the work supplies actionable guidance on curating distillation data for LRM safety and questions whether extended reasoning chains are required for alignment. The public release of code and data strengthens reproducibility.

major comments (3)
  1. [§4 (Evaluation)] §4 (Evaluation): The safety benchmarks, judge model, refusal-rate definitions, and controls for output length, formatting, or verbosity are not described in sufficient detail. This is load-bearing for the central claim, as the reported gains from addressing risky patterns and adopting short reasoning could reflect stylistic artifacts rather than reduced harmful propensity.
  2. [§3.1 (Direct Distillation)] §3.1 (Direct Distillation): The claim that direct distillation 'fails to significantly enhance safety' is presented without quantitative effect sizes, baseline comparisons to other SFT safety methods, or statistical tests, undermining assessment of both the problem magnitude and the success of the five-pattern mitigation.
  3. [§5 (Ablation Study)] §5 (Ablation Study): The ablation results on training configurations report point estimates without variance across runs, multiple random seeds, or sensitivity checks, leaving the robustness of configuration impacts unclear.
minor comments (2)
  1. [Abstract] The abstract states 'substantial safety improvements' without referencing specific tables or figures that quantify the gains.
  2. [§3.2] Notation for the five risky patterns could be introduced earlier with a summary table for reader convenience.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback on our manuscript. The comments have helped us identify areas where we can improve the clarity and rigor of our empirical study on enhancing LRM safety. We address each major comment below, indicating the revisions made to the manuscript.

read point-by-point responses
  1. Referee: §4 (Evaluation): The safety benchmarks, judge model, refusal-rate definitions, and controls for output length, formatting, or verbosity are not described in sufficient detail. This is load-bearing for the central claim, as the reported gains from addressing risky patterns and adopting short reasoning could reflect stylistic artifacts rather than reduced harmful propensity.

    Authors: We agree with the referee that additional details on the evaluation setup are crucial for validating our claims. In the revised manuscript, we have substantially expanded §4 to provide: comprehensive descriptions of the safety benchmarks, including the specific datasets and query types used; the identity and configuration of the judge model along with the exact prompting strategy for safety assessment; clear definitions and calculation methods for refusal rates; and explicit controls for output length, formatting, and verbosity, such as length-matched comparisons and analysis of response styles. These revisions aim to demonstrate that the safety improvements stem from genuine reductions in harmful content rather than superficial changes in output style. revision: yes

  2. Referee: §3.1 (Direct Distillation): The claim that direct distillation 'fails to significantly enhance safety' is presented without quantitative effect sizes, baseline comparisons to other SFT safety methods, or statistical tests, undermining assessment of both the problem magnitude and the success of the five-pattern mitigation.

    Authors: We appreciate this observation. To address it, we have updated §3.1 with quantitative effect sizes showing the safety performance metrics before and after direct distillation, including specific deltas compared to the base LRM. We have also incorporated baseline comparisons to alternative SFT approaches for safety, such as those using curated safety datasets without reasoning components. While we did not perform formal statistical significance tests across multiple runs (due to resource limitations), the consistent trends across different model scales and benchmarks support our conclusions regarding the failure of naive distillation and the effectiveness of the five-pattern mitigation. revision: partial

  3. Referee: §5 (Ablation Study): The ablation results on training configurations report point estimates without variance across runs, multiple random seeds, or sensitivity checks, leaving the robustness of configuration impacts unclear.

    Authors: Thank you for highlighting this aspect of robustness. In the revised §5, we now explicitly state that results are from single runs with a fixed seed due to the substantial computational requirements of fine-tuning LRMs. To mitigate concerns, we have added sensitivity analyses by varying training hyperparameters (e.g., learning rate, epochs) within reasonable ranges and confirming that the relative impacts of configurations remain consistent. We believe this provides sufficient evidence for the reliability of the reported trends in our ablation study. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential predictions

full rationale

This paper conducts an empirical investigation into safety enhancements for Large Reasoning Models via supervised fine-tuning on distilled data. It reports an initial observation that direct distillation from DeepSeek-R1 does not improve safety, identifies five risky patterns through analysis, shows that addressing them yields gains, and finds that short or template-based reasoning achieves comparable results. A final ablation explores training configurations. No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on experimental outcomes rather than reducing to fitted parameters or self-citations by construction. The study releases code and data for external verification, satisfying the criterion for self-contained empirical work against benchmarks. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on standard machine-learning assumptions that SFT transfers safety from curated data and that benchmark scores reflect genuine risk reduction; no new free parameters, axioms beyond domain norms, or invented entities appear in the abstract.

axioms (2)
  • domain assumption Supervised fine-tuning on curated safe responses transfers safety properties to the target model
    Invoked throughout the distillation and training sections as the core mechanism.
  • domain assumption Existing safety benchmarks provide a reliable proxy for real-world harm
    Used to measure all reported improvements.

pith-pipeline@v0.9.0 · 5805 in / 1360 out tokens · 61309 ms · 2026-05-22T14:02:34.123533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

    cs.AI 2025-10 unverdicted novelty 7.0

    Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step inter...

  2. Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

    cs.AI 2026-05 unverdicted novelty 6.0

    Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.

  3. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

  4. Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

    cs.LG 2026-04 unverdicted novelty 6.0

    PRJA achieves 83.6% average success injecting harmful content into LRM reasoning chains on five QA datasets without altering final answers.