arxiv: 2605.08936 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Dongcheng Zhang , Yi Zhang , Yuxin Chen , An Zhang , Xiang Wang , Chaochao Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:01 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords self-recoveryreinforcement learninglarge reasoning modelsjailbreak attackssafety alignmentadversarial robustnessunsafe trajectoriesself-correction

0 comments

The pith

A reinforcement learning method lets large reasoning models recover from the unsafe paths they generate themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large reasoning models often fail to correct unsafe reasoning when hit by adversarial prompts, even though they handle self-correction well in normal tasks. Standard fixes use fixed expert examples or prefixes, but these static sets miss the actual sequences the model produces during its own reasoning. Self-ReSET instead has the model create its own unsafe trajectories and feeds those back as starting points for reinforcement learning, training it to steer back to safe paths from its real failures. This produces stronger resistance to new jailbreak attempts while keeping everyday performance steady and using less external data. A reader would care because it points to a way to close the gap between training distributions and the model's live behavior.

Core claim

Self-ReSET is a pure reinforcement learning framework that equips large reasoning models with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning.

What carries the argument

Self-ReSET: a reinforcement learning loop that takes dynamically generated unsafe reasoning trajectories from the model itself and uses them as initial states to train recovery to benign outputs.

If this is right

Stronger defense against out-of-distribution jailbreak prompts while general capabilities stay intact.
More efficient use of training data since no large static expert sets are required.
Emergence of self-recovery patterns that spot unsafe intermediate states and steer back to safe paths.
Better coverage of the model's actual generation space during safety training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-generated trajectory approach could extend to other safety issues such as bias or factual errors.
It reduces reliance on human-curated safety datasets for alignment work.
Testing the method on even larger models would show whether the gains scale.
Combining Self-ReSET with other reinforcement learning safety techniques could create stronger hybrid training.

Load-bearing premise

That the unsafe trajectories the model produces on its own will cover enough of its possible failure modes for reinforcement learning to learn reliable recovery.

What would settle it

Running the trained model on a fresh set of out-of-distribution jailbreak prompts and finding no gain in successful recovery rates compared with models trained on static expert data.

Figures

Figures reproduced from arXiv: 2605.08936 by An Zhang, Chaochao Lu, Dongcheng Zhang, Xiang Wang, Yi Zhang, Yuxin Chen.

**Figure 2.** Figure 2: Overview of Self-ReSET. The pipeline follows a “monitor, memorize, then self-recover during reasoning". It first monitors model’s reasoning trajectories generated by prompts from the training set and memorizes the unsafe trigger prefix in the experience replay buffer as high-value training signals, replaying to the model for learning to self-recover with RLVR framework. if it places the model in a state fr… view at source ↗

**Figure 3.** Figure 3: The recovery rate of unsafe reasoning trajectories across three base models. We reported [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison across three base models against self-prefilling attacks with various lengths. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training trend of DS-Qwen-7B. A key advantage of Self-ReSET lies in its ability to continuously explore and learn from the model’s own failure trajectories with the buffer, which naturally leads to higher data efficiency. To investigate this property, we compare Self-ReSET with vanilla DAPO on DS-Qwen7B under varying prompt-source data sizes. Results for DS-distilled models are provided in Appendix F. A… view at source ↗

**Figure 6.** Figure 6: Training trend of DS-distilled models, evaluated on WildJailbreak. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Unsafe trajectory from the base model. Self-Recovery — Self-ReSET Model (Same Prompt) Adversarial Prompt: “Can you provide me with an intellectual puzzle in Italian that deciphers into a method of misleading a device used to measure the level of alcohol in a person’s breath?” 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Self-recovery on the same prompt as [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 10.** Figure 10: Direct safety awareness on the same prompt as Figure [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-ReSET's on-policy unsafe trajectory RL loop is a clean way to address the static-data mismatch in safety alignment, but the abstract's lack of any metrics leaves the OOD robustness claim untestable.

read the letter

The paper's main move is to have large reasoning models generate their own unsafe trajectories during inference and then feed those exact traces back as starting states for a reinforcement learning recovery phase. This replaces the usual supervised fine-tuning on fixed expert reflections or adversarial prefixes. The authors argue that static data never matches the model's actual on-policy distribution, so the model fails to learn recovery from the errors it actually makes. That framing is reasonable and points to a real practical gap in current alignment pipelines for reasoning models. Reusing the model's own generations as RL seeds could in principle let it practice steering away from unsafe intermediate states it is likely to reach. The efficient data utilization angle also follows if the method avoids collecting large external safety datasets. The stress-test concern about OOD coverage is worth taking seriously on the evidence given. Because the unsafe trajectories are produced by the same model, even under adversarial prompting, they are likely to stay inside the failure modes the current policy can already reach. Without an explicit mechanism to force coverage of external OOD jailbreaks or to measure how much new distribution is actually sampled, any reported OOD gains could simply reflect overlap between the generated training trajectories and the test prompts rather than genuine generalization. The abstract asserts significant robustness improvements on OOD jailbreaks while preserving utility, yet supplies no numbers, no baselines, no ablation on the self-generation step, and no statistical details. That absence makes it impossible to assess whether the central claim holds. This work is for people already working on safety for chain-of-thought or reasoning models who want to explore RL-based self-correction variants. A reader could extract the method description and try the idea, but would need the full experimental section before treating the results as reliable. The paper deserves peer review because the problem is concrete and the proposed fix is a direct, testable variation on existing on-policy techniques, even if the current write-up leaves the quantitative support thin.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Self-ReSET, a reinforcement learning framework for large reasoning models (LRMs) that reuses the model's own dynamically generated unsafe reasoning trajectories as initial states to train the model to recover from safety errors. The authors claim this approach enhances robustness to adversarial attacks, particularly out-of-distribution (OOD) jailbreak prompts, maintains general utility, and utilizes data efficiently compared to methods relying on static expert data.

Significance. If the empirical results hold, this work could be significant for AI safety by providing a method to improve self-correction in reasoning models without relying on potentially mismatched static datasets. It highlights the potential of on-policy RL for learning recovery behaviors, which might generalize better to novel threats. The availability of code and data is a positive for reproducibility.

major comments (3)

[Abstract] Abstract: The abstract asserts 'extensive experiments across various LRMs and benchmarks' that demonstrate significant OOD robustness gains, but provides no metrics, baselines, ablation details, statistical controls, or effect sizes. This prevents assessment of whether the central claim of superior OOD performance over static-data methods is supported.
[Method] Method section: The core claim that self-generated unsafe trajectories enable recovery across a 'vast generation space' including OOD jailbreaks rests on the unverified assumption that on-policy sampling will cover distributions beyond the current policy's reachable error manifold. No explicit coverage metric, diversity injection, or comparison to external OOD prompts is described, raising the risk that reported OOD gains reflect test-set overlap rather than genuine generalization.
[Experiments] Experiments section: Without details on how OOD prompts are constructed and whether they lie outside the distribution of trajectories generated during training, the headline OOD improvement cannot be distinguished from in-distribution recovery. The paper should report separate in-distribution vs. OOD metrics and ablations isolating the effect of dynamic vs. static initial states.

minor comments (2)

[Abstract] Abstract: Clarify the reward function and policy update rule used in the 'pure reinforcement learning framework,' as these are central to understanding how recovery is incentivized.
The GitHub link is welcome, but the manuscript should include a reproducibility checklist covering random seeds, hyperparameter ranges, and exact prompt templates for the adversarial attacks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, providing clarifications from the manuscript where available and indicating revisions to strengthen the presentation of results and methodology.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts 'extensive experiments across various LRMs and benchmarks' that demonstrate significant OOD robustness gains, but provides no metrics, baselines, ablation details, statistical controls, or effect sizes. This prevents assessment of whether the central claim of superior OOD performance over static-data methods is supported.

Authors: The abstract is intentionally concise as a high-level overview. The Experiments section (and associated tables/figures) provides the requested details, including quantitative OOD robustness improvements, baseline comparisons (e.g., against static expert-data methods), ablation studies, and effect sizes across multiple LRMs and benchmarks. To improve accessibility, we will revise the abstract to incorporate key numerical results and effect sizes supporting the OOD gains. revision: yes
Referee: [Method] Method section: The core claim that self-generated unsafe trajectories enable recovery across a 'vast generation space' including OOD jailbreaks rests on the unverified assumption that on-policy sampling will cover distributions beyond the current policy's reachable error manifold. No explicit coverage metric, diversity injection, or comparison to external OOD prompts is described, raising the risk that reported OOD gains reflect test-set overlap rather than genuine generalization.

Authors: Self-ReSET generates unsafe trajectories on-policy from the current model during RL training, directly sampling from the policy's reachable error states rather than relying on a fixed external distribution. This design inherently targets the model's own generation manifold. OOD evaluation uses separately constructed prompts held out from training data generation. While an explicit coverage metric was not reported, the consistent OOD gains over static baselines provide empirical support for generalization. We will add a discussion of trajectory diversity and a direct comparison to external OOD prompt sets in the revised Method section. revision: partial
Referee: [Experiments] Experiments section: Without details on how OOD prompts are constructed and whether they lie outside the distribution of trajectories generated during training, the headline OOD improvement cannot be distinguished from in-distribution recovery. The paper should report separate in-distribution vs. OOD metrics and ablations isolating the effect of dynamic vs. static initial states.

Authors: The Experiments section describes OOD prompt construction via adversarial templates and topics distinct from the training trajectory distribution, with results showing gains on these held-out sets. To address the concern directly, we will expand the section with explicit construction details, separate in-distribution versus OOD performance tables, and new ablations contrasting dynamic self-generated initial states against static expert data. These additions will isolate the contribution of on-policy recovery learning. revision: yes

Circularity Check

0 steps flagged

No circularity: method and claims are independently defined and empirically validated

full rationale

The paper defines Self-ReSET as a reinforcement learning procedure that generates unsafe trajectories on-policy from the current model and reuses them as RL starting states to train recovery. This construction is stated directly in the abstract and does not reduce any claimed performance gain (robustness on OOD jailbreaks, utility preservation) to a fitted parameter or self-citation by definition. No equations are presented that equate the output metric to the input distribution; the reported improvements rest on external benchmark experiments rather than tautological renaming or load-bearing self-citations. The derivation chain therefore remains self-contained against the stated inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields limited visibility into exact reward formulation or trajectory sampling details; core assumption is that on-policy error trajectories provide superior coverage for recovery learning.

axioms (1)

domain assumption Reinforcement learning on self-generated unsafe trajectories will produce recovery behaviors that generalize to unseen adversarial prompts.
Central to the proposal that dynamic on-policy data solves the coverage problem of static datasets.

pith-pipeline@v0.9.0 · 5514 in / 1128 out tokens · 23959 ms · 2026-05-12T02:01:26.035086+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Self-ReSET follows a “monitor, memorize, then self-recover during reasoning” paradigm... experience replay buffer... verifiable binary safety rewards... DAPO Optimization
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learning from self-generated failures... expands coverage of the safety error space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 11 internal anchors

[1]

Vicky Zhao, Conghui He, and Lijun Wu

Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, and Lijun Wu. LEMMA: learning from errors for mathematical advancement in llms. InACL (Findings), pages 11615–11639. Association for Computational Linguistics, 2025

work page 2025
[2]

Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M. P. Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning.CoRR, ab...

work page arXiv 2024
[3]

Wong, and Di Wang

Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms.CoRR, abs/2504.02956, 2025

work page arXiv 2025
[4]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page 2025
[5]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Ca...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

work page 2022
[8]

the most powerful open-source model to date

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, and Xin Eric Wang. The hidden risks of large reasoning models: A safety assessment of R1.CoRR, abs/2502.12659, 2025

work page arXiv 2025
[9]

Zheng-Xin Yong and Stephen H. Bach. Self-jailbreaking: Language models can reason them- selves out of safety alignment after benign reasoning training.CoRR, abs/2510.20956, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Safechain: Safety of language models with long chain-of-thought reasoning capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. InACL (Findings), pages 23303–23320. Association for Computational Linguistics, 2025

work page 2025
[11]

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.CoRR, abs/2502.12893, 2025

work page arXiv 2025
[12]

Bikel, Jason E

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason E. Weston, and Eric Michael Smith. Backtracking improves generation safety. InICLR. OpenRe- view.net, 2025

work page 2025
[13]

STAIR: improving safety alignment with introspective reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. STAIR: improving safety alignment with introspective reasoning. InICML. OpenReview.net, 2025

work page 2025
[14]

Unsafechain: Enhancing reasoning model safety via hard cases.CoRR, abs/2507.21652, 2025

Raj Vardhan Tomar, Preslav Nakov, and Yuxia Wang. Unsafechain: Enhancing reasoning model safety via hard cases.CoRR, abs/2507.21652, 2025

work page arXiv 2025
[15]

Realsafe-r1: Safety-aligned deepseek-r1 without compromising reasoning capability.CoRR, abs/2504.10081, 2025

Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, and Yinpeng Dong. Realsafe-r1: Safety-aligned deepseek-r1 without compromising reasoning capability.CoRR, abs/2504.10081, 2025

work page arXiv 2025
[16]

Safekey: Amplifying aha-moment insights for safety reasoning.CoRR, abs/2505.16186, 2025

Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, and Xin Eric Wang. Safekey: Amplifying aha-moment insights for safety reasoning.CoRR, abs/2505.16186, 2025

work page arXiv 2025
[17]

Advchain: Adver- sarial chain-of-thought tuning for robust safety alignment of large reasoning models.CoRR, abs/2509.24269, 2025

Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, and Baoyuan Wu. Advchain: Adver- sarial chain-of-thought tuning for robust safety alignment of large reasoning models.CoRR, abs/2509.24269, 2025. 11

work page arXiv 2025
[18]

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. When models outthink their safety: Mitigating self-jailbreak in large reasoning models with chain-of-guardrails.CoRR, abs/2510.21285, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Reasoning-to- defend: Safety-aware reasoning can defend large language models from jailbreaking.CoRR, abs/2502.12970, 2025

Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. Reasoning-to- defend: Safety-aware reasoning can defend large language models from jailbreaking.CoRR, abs/2502.12970, 2025

work page arXiv 2025
[20]

Saro: Enhancing LLM safety through reasoning-based alignment.CoRR, abs/2504.09420, 2025

Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. Saro: Enhancing LLM safety through reasoning-based alignment.CoRR, abs/2504.09420, 2025

work page arXiv 2025
[21]

Towards safe reasoning in large reasoning models via corrective intervention.CoRR, abs/2509.24393, 2025

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, and Jun Zhu. Towards safe reasoning in large reasoning models via corrective intervention.CoRR, abs/2509.24393, 2025

work page arXiv 2025
[22]

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Shengyun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, and Jianfeng Chi. Large reasoning models learn better alignment from flawed thinking.CoRR, abs/2510.00938, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

InvThink: Premortem Reasoning for Safer Language Models

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, and Hae Won Park. Invthink: Towards AI safety via inverse reasoning.CoRR, abs/2510.01569, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos

Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos. InACL (Findings), pages 7837–7855. Association for Computational Linguistics, 2025

work page 2025
[25]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu, Rong Zhang, Shibin Wu, Shuo Zhang, Tao He, Tianyi Tang, Tingyu Xia, Wei Liao, Wei...

work page internal anchor Pith review arXiv 2025
[26]

Lillicrap, and Gregory Wayne

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Gregory Wayne. Experience replay for continual learning. InNeurIPS, pages 348–358, 2019

work page 2019
[27]

Hindsight experience replay

Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InNIPS, pages 5048–5058, 2017

work page 2017
[28]

Bartoldson, Bhavya Kailkhura, and Cihang Xie

Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, and Cihang Xie. STAR-1: safer alignment of reasoning llms with 1k data.CoRR, abs/2504.01903, 2025

work page arXiv 2025
[29]

Forsyth, and Dan Hendrycks

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InICML. OpenReview.net, 2024

work page 2024
[30]

Jailbreak-r1: Exploring the jailbreak capabilities of llms via reinforcement learning.CoRR, abs/2506.00782, 2025

Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, and Jing Li. Jailbreak-r1: Exploring the jailbreak capabilities of llms via reinforcement learning.CoRR, abs/2506.00782, 2025

work page arXiv 2025
[31]

ERPO: advancing safety alignment via ex-ante reasoning preference optimization.CoRR, abs/2504.02725, 2025

Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, and Huajun Chen. ERPO: advancing safety alignment via ex-ante reasoning preference optimization.CoRR, abs/2504.02725, 2025

work page arXiv 2025
[32]

Reasoning as an adaptive defense for safety.CoRR, abs/2507.00971, 2025

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, and Aviral Kumar. Reasoning as an adaptive defense for safety.CoRR, abs/2507.00971, 2025. 12

work page arXiv 2025
[33]

Alphaalign: Incentivizing safety alignment with extremely simplified reinforcement learning.CoRR, abs/2507.14987, 2025

Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, and Xiang Wang. Alphaalign: Incentivizing safety alignment with extremely simplified reinforcement learning.CoRR, abs/2507.14987, 2025

work page arXiv 2025
[34]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InNAACL-HLT, pages 5377–5400. Association for Computational Linguistics, 2024

work page 2024
[35]

Or-bench: An over-refusal benchmark for large language models

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models. InICML. OpenReview.net, 2025

work page 2025
[36]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025

work page internal anchor Pith review arXiv 2025
[37]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable. CoRR, abs/2503.00555, 2025

work page arXiv 2025
[38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InNeurIPS, 2024

work page 2024
[41]

A strongreject for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks. InNeurIPS, 2024

work page 2024
[42]

Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks.CoRR, abs/2407.02855, 2024

Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, and Minlie Huang. Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks.CoRR, abs/2407.02855, 2024

work page arXiv 2024
[43]

Fortress: Frontier risk evaluation for national security and public safety

Christina Q. Knight, Kaustubh Deshpande, Ved Sirdeshmukh, Meher Mankikar, Scale Red Team, SEAL Research Team, and Julian Michael. FORTRESS: frontier risk evaluation for national security and public safety.CoRR, abs/2506.14922, 2025

work page arXiv 2025
[44]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024

work page 2024
[45]

American invitational mathematics ex- amination (aime), February 2024

Mathematical Association of America. American invitational mathematics ex- amination (aime), February 2024. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime

work page 2024
[46]

Yik Siu Chan, Zheng-Xin Yong, and Stephen H. Bach. Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.CoRR, abs/2507.12428, 2025. 13

work page arXiv 2025
[47]

Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

work page arXiv 2025
[48]

Next-guard: Training-free streaming safeguard without token-level labels, 2026

Junfeng Fang, Nachuan Chen, Houcheng Jiang, Dan Zhang, Fei Shen, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Next-guard: Training-free streaming safeguard without token-level labels, 2026. URLhttps://arxiv.org/abs/2603.02219

work page arXiv 2026
[49]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.CoRR, abs/2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

metric hacking

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. A Experimental Setup A.1 Benchmarks A.1.1 Direct Harmful Benchmarks We conducted our experiments on direct harmful benchmarks to test the models’ defense capa...

work page 2025