Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng , Eric Smith , Ivan Evtimov , Song Jiang , Pin-Yu Chen , Hongyuan Zhan , Haozhu Wang , Duen Horng Chau

show 2 more authors

Mahesh Pasupuleti Jianfeng Chi

Authors on Pith no claims yet

classification 💻 cs.LG

keywords reasoningmodelssafetyalignmentflawedcounter-alignedlargelearning

0 comments

read the original abstract

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
cs.AI 2026-05 unverdicted novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification
cs.AI 2026-05 unverdicted novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...