THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Dongmin Park; Gyeongman Kim; Jihun Yun; Jongho Park; Minki Kang; Sangwoo Park; Seanie Lee; Sung Ju Hwang; Yumin Choi

arxiv: 2601.23143 · v4 · submitted 2026-01-30 · 💻 cs.AI

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee , Sangwoo Park , Yumin Choi , Gyeongman Kim , Minki Kang , Jihun Yun , Dongmin Park , Jongho Park

show 1 more author

Sung Ju Hwang

This is my paper

Pith reviewed 2026-05-16 09:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords safety alignmentreasoning modelsKL divergenceself-generated alignmentrefusal steeringlarge language modelschain-of-thought

0 comments

The pith

Reasoning models regain safety by projecting onto their own safety-filtered distribution as the unique KL-optimal target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models lose safety after RL training for chain-of-thought because the optimization favors compliance over refusal of harm. The paper proves that safety realignment works best as a KL projection onto the safe simplex when the target is the model's own safety-filtered outputs, since any external teacher adds an extra irreducible divergence penalty. ThinkSafe implements this insight through lightweight refusal steering that unlocks the model's retained ability to detect harm while leaving its reasoning distribution intact. Experiments confirm the method raises safety metrics on models such as DeepSeek-R1-Distill and Qwen3 without degrading reasoning performance and at far lower compute cost than teacher-based baselines.

Core claim

Safety realignment is formalized as a KL projection onto the safe simplex. The student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. ThinkSafe restores safety by lightweight refusal steering that preserves this optimal target and raises the acceptance rate of safe outputs.

What carries the argument

KL projection onto the safe simplex using the model's own safety-filtered distribution as the target.

If this is right

ThinkSafe improves safety while preserving reasoning proficiency on DeepSeek-R1-Distill and Qwen3.
It achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute.
The self-generated target avoids the distributional discrepancy introduced by external teachers.
Lightweight refusal steering increases the rate of safe refusals while keeping the KL-optimal target unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-projection approach could be tested on other suppressed behaviors such as factual consistency after heavy RL.
Periodic self-alignment steps might be inserted into training pipelines to maintain safety alongside reasoning gains.
The result implies that compliance training hides rather than erases safety knowledge, so minimal steering suffices for recovery.

Load-bearing premise

Models retain latent knowledge to identify harm after compliance-focused optimization, which can be reliably unlocked via lightweight refusal steering without degrading native reasoning.

What would settle it

An experiment that measures the KL divergence to the safe simplex when using an external teacher versus the model's own safety-filtered outputs, or that shows refusal steering fails to produce additional safe refusals on a held-out set of harmful prompts.

read the original abstract

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinkSafe's self-alignment via refusal steering is practically useful but its KL-optimality proof does not cover the actual method deployed.

read the letter

The main thing to know is that this paper formalizes safety realignment for reasoning models as a KL projection onto the safe simplex and claims the student's own filtered distribution is the unique optimum, beating any external teacher by an irreducible gap. They then implement ThinkSafe with lightweight refusal steering to surface latent safety knowledge while keeping reasoning intact, and report gains on DeepSeek-R1-Distill and Qwen3 with far less compute than GRPO baselines. Code and models are released, which helps.

Referee Report

2 major / 1 minor

Summary. The manuscript formalizes safety realignment for large reasoning models as a KL projection onto the safe simplex and proves that the student's own safety-filtered distribution is the unique KL-optimal target, while external teachers incur an irreducible excess KL penalty. Guided by this, it introduces ThinkSafe, a self-generated alignment method using lightweight refusal steering to unlock latent safety knowledge without external distillation. Experiments on DeepSeek-R1-Distill and Qwen3 models claim significant safety gains while preserving reasoning proficiency, outperforming GRPO with roughly an order of magnitude less compute.

Significance. If the theoretical result holds and the method implements the claimed KL-optimal target without altering safe conditionals, the work would offer a principled, compute-efficient alternative to teacher-based distillation for reasoning models, reducing distributional shift while maintaining native capabilities. The open release of code, models, and datasets supports reproducibility and follow-up work.

major comments (2)

[Abstract and theory section] Abstract and theory section: The proof asserts that the safety-filtered distribution is the unique minimizer of KL(p || q) over the safe simplex S. This requires that the filtering operation (and subsequent lightweight refusal steering) leaves conditional distributions over safe tokens unchanged, only zeroing unsafe mass and renormalizing. The deployed ThinkSafe method modifies CoT patterns and refusal phrasing, which alters safe-path probabilities; therefore the implemented q_self lies outside the exact projection and the uniqueness/excess-penalty result does not transfer to the method as described.
[Experiments section] Experiments section: The abstract claims superiority in safety and comparable reasoning to GRPO with an order of magnitude less compute, yet no specific metrics, baselines, error bars, or statistical tests are provided in the summary. Exact numbers, ablation results on the steering strength, and direct comparison tables are needed to substantiate the empirical claims.

minor comments (1)

[Abstract] Abstract: The phrase 'roughly an order of magnitude less compute' should specify the exact compute metric (e.g., FLOPs, GPU-hours, or tokens) used for the GRPO comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important distinctions between the theoretical ideal and practical implementation, as well as the need for more detailed empirical reporting. We address each point below and will revise the manuscript to strengthen the connection between theory and method while providing fuller experimental details.

read point-by-point responses

Referee: [Abstract and theory section] Abstract and theory section: The proof asserts that the safety-filtered distribution is the unique minimizer of KL(p || q) over the safe simplex S. This requires that the filtering operation (and subsequent lightweight refusal steering) leaves conditional distributions over safe tokens unchanged, only zeroing unsafe mass and renormalizing. The deployed ThinkSafe method modifies CoT patterns and refusal phrasing, which alters safe-path probabilities; therefore the implemented q_self lies outside the exact projection and the uniqueness/excess-penalty result does not transfer to the method as described.

Authors: We appreciate this precise observation on the gap between the idealized projection and the implemented method. The core theorem establishes the safety-filtered distribution as the unique KL minimizer under the assumption of unchanged safe conditionals. ThinkSafe's lightweight refusal steering is intended as a practical approximation that primarily suppresses unsafe trajectories while retaining the model's native safe reasoning distributions to the greatest extent possible. We agree that modifications to CoT patterns and refusal phrasing introduce some deviation from the exact projection. In the revised version, we will add an explicit discussion of this approximation in the theory section, including empirical measurements of the deviation (e.g., KL divergence between steered and unsteered safe paths) and a statement that the method targets a close approximation to the KL-optimal distribution rather than claiming exact equivalence. Claims in the abstract and introduction will be adjusted accordingly to reflect this nuance. revision: partial
Referee: [Experiments section] Experiments section: The abstract claims superiority in safety and comparable reasoning to GRPO with an order of magnitude less compute, yet no specific metrics, baselines, error bars, or statistical tests are provided in the summary. Exact numbers, ablation results on the steering strength, and direct comparison tables are needed to substantiate the empirical claims.

Authors: The full manuscript contains detailed experimental results, including safety win rates, reasoning benchmarks (MATH, GSM8K, etc.), direct comparison tables versus GRPO and other baselines, ablation studies varying steering strength, and results averaged over multiple seeds with standard deviations. To address the concern, we will expand the experiments section in the revision to prominently feature all quantitative metrics, full tables with error bars, ablation results on steering hyperparameters, and statistical significance tests (e.g., paired t-tests) where appropriate. The abstract will also be updated to reference key numbers supporting the claimed improvements and compute savings. revision: yes

Circularity Check

1 steps flagged

KL-optimality of self-filtered distribution follows by definition of the projection

specific steps

self definitional [Abstract]
"We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty."

The safe simplex S is the set of distributions with zero mass on harmful outputs. The KL projection is defined as argmin_{q in S} KL(p || q). The safety-filtered distribution is exactly the renormalization of p restricted to safe tokens (zeroing unsafe mass and rescaling), which by construction is the unique element of S achieving the minimum. The uniqueness statement and excess KL for any external q_T are therefore tautological consequences of the definition, not an independent result.

full rationale

The paper's central theoretical contribution formalizes safety realignment as a KL projection onto the safe simplex and asserts that the student's safety-filtered distribution is the unique minimizer. This uniqueness and the claimed excess penalty for external teachers are immediate consequences of how the projection is defined (renormalization of p onto safe support), making the result self-definitional rather than independently derived. The implementation via refusal steering is presented as preserving the target, but the optimality claim itself reduces to the setup. No other circular patterns (self-citation load-bearing, fitted predictions, or ansatz smuggling) appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that latent harm-identification knowledge survives compliance optimization and can be accessed via steering; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Safety realignment can be formalized as a KL projection onto the safe simplex
This formalization underpins the uniqueness proof and is invoked to compare self-generated versus external targets.

pith-pipeline@v0.9.0 · 5566 in / 1273 out tokens · 45468 ms · 2026-05-16T09:30:42.090190+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
cs.LG 2026-05 conditional novelty 6.0

On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
cs.LG 2026-05 unverdicted novelty 4.0

SELFCI uses complementary self-distillation with two reverse KL divergences to align LLMs to contextual integrity while preserving utility, outperforming RL baselines like GRPO in agentic settings.