Recognition: 2 theorem links
· Lean TheoremInternalizing Safety Understanding in Large Reasoning Models via Verification
Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3
The pith
Training large reasoning models exclusively on safety verification tasks internalizes intrinsic safety understanding that generalizes to resist out-of-domain jailbreaks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current alignment optimizes models to spot bad prompts but leaves them without intrinsic ability to judge the safety of their own generated answers, which makes them vulnerable to jailbreaks. Safety Internal trains large reasoning models solely on safety verification using expert trajectories so they learn to critique their own outputs. This induces generalization in response safety that improves robustness against out-of-domain jailbreaks. When combined with reinforcement learning, the resulting initialization creates a more robust alignment foundation than standard supervised fine-tuning.
What carries the argument
Safety Internal (SInternal), a framework that trains models exclusively on self-verification of response safety using expert reasoning trajectories to internalize safety specifications.
If this is right
- Learning to verify safety produces generalization in response safety beyond the training distribution.
- Robustness against out-of-domain jailbreaks increases substantially.
- SInternal provides a stronger initialization for reinforcement learning than standard supervised fine-tuning.
- Internalizing safety understanding creates a more robust foundation for alignment than mimicking safe behaviors.
Where Pith is reading between the lines
- Verification training could extend to other properties such as factual accuracy or logical consistency by using similar expert trajectories.
- Models might self-correct or refuse unsafe generations during inference without needing extra external prompts.
- The approach could reduce reliance on post-generation moderation systems if the internalized checks prove reliable at scale.
Load-bearing premise
Training exclusively on safety verification tasks with expert reasoning trajectories will produce intrinsic safety understanding that generalizes beyond the training distribution instead of superficial pattern matching or memorization.
What would settle it
If a model trained with SInternal shows no improvement over a standard aligned model when tested on previously unseen jailbreak prompts that target response safety, the generalization claim would be falsified.
Figures
read the original abstract
While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab-USTC/SInternal
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the SInternal framework for large reasoning models (LRMs). It trains models exclusively on safety verification tasks in which the model critiques its own outputs using expert reasoning trajectories. The central claims are that this procedure internalizes safety specifications, yielding strong generalization to response safety, substantially improved robustness to out-of-domain jailbreaks, and a superior initialization for subsequent reinforcement learning compared with standard supervised fine-tuning.
Significance. If the empirical results hold after rigorous controls for memorization versus internalization, the work would offer a concrete alternative to purely behavioral alignment methods and could improve the robustness of reasoning models under adversarial conditions. The public release of code is a positive contribution to reproducibility.
major comments (3)
- [§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.
- [§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.
- [§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.
minor comments (2)
- [§3] Notation for the SInternal training objective and the expert trajectory format should be formalized with equations in §3 to improve clarity.
- [Figures in §4] Figure captions should explicitly state the number of runs, random seeds, and statistical significance tests used for all reported metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for strengthening the empirical rigor and reproducibility of our work. We address each major comment below and commit to the corresponding revisions.
read point-by-point responses
-
Referee: [§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.
Authors: We agree that distinguishing internalization from memorization or distributional overlap is essential. In the revised manuscript, we will expand the experimental section to report the full diversity of safety rules and jailbreak styles in the verification training set. We will also add new controls, including experiments on novel rule combinations and structurally dissimilar attacks absent from training, to provide direct evidence that robustness gains stem from internalized safety specifications rather than pattern matching. revision: yes
-
Referee: [§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.
Authors: We acknowledge that these implementation details are necessary for readers to attribute the advantage correctly. The revised version will specify the exact RL algorithm, reward model, number of training steps, and report safety metrics both before and after RL fine-tuning for SInternal and SFT initializations under identical conditions. This will confirm the benefit arises from the verification pre-training. revision: yes
-
Referee: [§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.
Authors: We will revise §2 to include the precise models evaluated, the exact prompts used, the observed verification failure rates, and the quantitative thresholds applied in the empirical analysis. These additions will make the baseline reproducible and allow direct measurement of improvements. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces SInternal as a training procedure that fine-tunes LRMs exclusively on safety-verification tasks using expert trajectories, then reports empirical gains in out-of-domain robustness and RL initialization. No equations, fitted parameters, or mathematical derivations appear in the abstract or described framework; the generalization claim is presented as an observed outcome of the training regime rather than a quantity forced by construction from the inputs. No self-citations are invoked to justify uniqueness or to close the argument, and the central premise is externally falsifiable via the reported experiments and released code. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Ostensibly aligned models lack intrinsic safety understanding and remain vulnerable to adversarial jailbreaks
- ad hoc to paper Training exclusively on safety verification tasks induces strong generalization for response safety
invented entities (1)
-
SInternal framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SInternal trains LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
learning to verify induces a strong generalization for response safety
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://www.anthropic.com/news/ claude-opus-4-5. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosiute, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Learning to self-verify makes language models better reasoners
Chen, Y ., Wang, Y ., Zhang, Y ., Ye, Z., Cai, Z., Shi, Y ., Gu, Q., Su, H., Cai, X., Wang, X., Zhang, A., and Chua, T. Learning to self-verify makes language models better reasoners.CoRR, abs/2602.07594,
-
[3]
URL https://ai.google/static/documents/ ai-responsibility-update-published-february-2025. pdf. Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., and Glaese, A. Deliberative alignment: Reasoning enables safer language models.CoRR, abs/2412.16339,
-
[4]
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, ...
-
[5]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H. Open-reasoner-zero: An open source approach to scal- ing up reinforcement learning on the base model.CoRR, abs/2503.24290,
work page internal anchor Pith review arXiv
-
[6]
Safety tax: Safety alignment makes your large reasoning models less reasonable
Huang, T., Hu, S., Ilhan, F., Tekin, S. F., Yahn, Z., Xu, Y ., and Liu, L. Safety tax: Safety alignment makes your large reasoning models less reasonable.CoRR, abs/2503.00555,
-
[7]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Madry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A. T., Kir- illov, A., Christakis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input- output safeguard for human-ai conversations.CoRR, abs/2312.06674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko, A., Applebaum, A., Jiang, A., Nair, A., Zoph, B., Ghorbani,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Kim, Y ., Kim, T., Park, E., Park, C., Breazeal, C., McDuff, D., and Park, H. W. Invthink: Towards AI safety via inverse reasoning.CoRR, abs/2510.01569,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Fortress: Frontier risk evaluation for national security and public safety
Knight, C. Q., Deshpande, K., Sirdeshmukh, V ., Mankikar, M., Team, S. R., Team, S. R., and Michael, J. FORTRESS: frontier risk evaluation for national security and public safety.CoRR, abs/2506.14922,
-
[12]
Kuo, M., Zhang, J., Ding, A., Wang, Q., DiValentin, L., Bao, Y ., Wei, W., Li, H., and Chen, Y . H-cot: Hijack- ing the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.CoRR, abs/2502.12893,
-
[13]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.CoRR, abs/2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Large Reasoning Models Learn Better Alignment from Flawed Thinking
URL https://model-spec.openai.com/ 2025-12-18.html. Accessed: 2026-01-28. Peng, S., Smith, E., Evtimov, I., Jiang, S., Chen, P., Zhan, H., Wang, H., Chau, D. H., Pasupuleti, M., and Chi, J. Large reasoning models learn better alignment from flawed thinking.CoRR, abs/2510.00938,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,
Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL https://thinkingmachines. ai/blog/lora/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O’Hara, C., Ols- son, C., Petrini...
-
[18]
HybridFlow: A Flexible and Efficient RLHF Framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Alphasteer: Learn- ing refusal steering with principled null-space constraint
Sheng, L., Shen, C., Zhao, W., Fang, J., Liu, X., Liang, Z., Wang, X., Zhang, A., and Chua, T. Alphasteer: Learn- ing refusal steering with principled null-space constraint. CoRR, abs/2506.07022,
-
[20]
Bartoldson, Bhavya Kailkhura, and Cihang Xie
Wang, Z., Tu, H., Wang, Y ., Wu, J., Mei, J., Bartold- son, B. R., Kailkhura, B., and Xie, C. STAR-1: safer alignment of reasoning llms with 1k data.CoRR, abs/2504.01903,
-
[21]
MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety
Wen, X., He, Z., Qi, H., Wan, Z., Ma, Z., Wen, Y ., Zheng, T., Xu, X., Lu, C., and Zhang, Q. MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety. CoRR, abs/2602.01539,
-
[22]
URL https://arxiv.org/abs/ 2512.07761. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M....
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Ying, Z., Zheng, G., Huang, Y ., Zhang, D., Zhang, W., Zou, Q., Liu, A., Liu, X., and Tao, D. Towards understanding 11 Internalizing Safety Understanding in Large Reasoning Models via Verification the safety boundaries of deepseek models: Evaluation and findings.CoRR, abs/2503.15092,
-
[24]
Yong, Z. and Bach, S. H. Self-jailbreaking: Language models can reason themselves out of safety alignment after benign reasoning training.CoRR, abs/2510.20956,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y ., Yan, L., Qiao, M., Wu, Y ., and Wang, M. DAPO: an open-sou...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
From hard refusals to safe-completions: Toward output-centric safety training
Yuan, Y ., Sriskandarajah, T., Brakman, A., Helyar, A., Beu- tel, A., Vallone, A., and Jain, S. From hard refusals to safe-completions: Toward output-centric safety training. CoRR, abs/2508.09224,
-
[27]
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Zhang, J., Wang, H., Smith, E. M., Wang, S., Sharaf, A., Pasupuleti, M., Durme, B. V ., Khashabi, D., Weston, J., and Zhan, H. The alignment waltz: Jointly training agents to collaborate for safety.CoRR, abs/2510.08240, 2025a. Zhang, Y ., Zhang, A., Zhang, X., Sheng, L., Chen, Y ., Liang, Z., and Wang, X. Alphaalign: Incentivizing safety align- ment with ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Zheng, J., Ji, X., Lu, Y ., Cui, C., Zhao, W., Deng, G., Liang, Z., Zhang, A., and Chua, T. Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736,
-
[29]
Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,
Zheng, J., Luo, Y ., Xu, J., Liu, B., Chen, Y ., Cui, C., Deng, G., Lu, C., Wang, X., Zhang, A., and Chua, T. Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,
-
[30]
the most powerful open-source model to date
Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403. 13372. Zhou, K., Liu, C., Zhao, X., Jangam, S., Srinivasa, J., Liu, G., Song, D., and Wang, X. E. The hidden risks of large reasoning models: A safety assessment of R1.CoRR, abs/2502.12659,
-
[31]
12 Internalizing Safety Understanding in Large Reasoning Models via Verification A. Related Works A.1. AI Safety Specification AI safety specifications define the principles governing model behavior, transforming abstract human values into explicit, interpretable rules (e.g., prohibiting harm-enabling content) (Bai et al., 2022; Guan et al., 2024; OpenAI,...
work page 2022
-
[32]
is a reasoning-oriented jailbreak benchmark introduced in the Mousetrap framework to study vulnerabilities of reasoning-capable LLMs. It constructs adversarial tasks that manipulate multi-step reasoning processes, demonstrating that enhanced reasoning abilities can amplify susceptibility to logical and cognitive attacks rather than improving safety robust...
work page 2025
-
[33]
It is designed to assess the robustness and depth of models’ reasoning abilities
is a benchmark derived from the American Invitational Mathematics Examination (AIME), containing competition-level mathematical problems that require multi-step reasoning and precise numerical answers. It is designed to assess the robustness and depth of models’ reasoning abilities. For MATH and AIME2024, we report pass@1 and pass@16, respectively, to ens...
work page 2022
-
[34]
with the GRPO (Shao et al., 2024; Yu et al., 2025). During training, the maximum prompt length and response length are set to 2048 and 8192 tokens, respectively. We use a rollout batch size of 64 prompts with n= 8 responses per prompt, and set the PPO mini-batch size to
work page 2024
-
[35]
For math rewards, we compute the reward using strict boxed-answer matching
For safety reward, we employ Qwen3-Guard-gen as the verification model to provide safety and refusal signals, given its superior performance on safety verification tasks (Zhao et al., 2025). For math rewards, we compute the reward using strict boxed-answer matching. B.5. Baseline Training Settings For a fair comparison, we restrict the training data to pr...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.