REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Bo Zou; Chaochao Lu; Chao Yang; Jiachen Ma; Jiawen Zhang; Xiangtian Li

arxiv: 2605.20654 · v1 · pith:46ZH5J3Dnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Jiachen Ma , Jiawen Zhang , Xiangtian Li , Bo Zou , Chaochao Lu , Chao Yang This is my paper

Pith reviewed 2026-05-21 06:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords jailbreak defenseself-reflectionLLM safetyreinforcement learningsupervised fine-tuningindirect attackstrajectory-level safety

0 comments

The pith

Reflector internalizes self-reflection in LLMs to defend against indirect jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflector as a two-stage method to embed step-wise reflection directly into an LLM's generation process. First it collects high-quality reflection examples from a teacher model and uses them for supervised fine-tuning; then it applies reinforcement learning with outcome and validity rewards to make the reflection autonomous. The goal is to move safety from surface-level checks to internal trajectory-level behavior so the model can resist multi-step indirect attacks. If this holds, the model would achieve strong defense rates while also improving accuracy on reasoning tasks instead of trading off capability for safety.

Core claim

Reflector is a two-stage framework that first performs teacher-guided supervised fine-tuning to establish structured reflection patterns and then applies reinforcement learning with outcome-driven and reward-validity supervision to internalize autonomous self-reflection, resulting in defense success rates above 90 percent against complex indirect jailbreaks and a 5.85 percent gain on GSM8K.

What carries the argument

The Reflector two-stage pipeline that internalizes trajectory-level safety by turning teacher-generated reflection data into autonomous, step-wise self-correction during generation.

If this is right

Defense success rates exceed 90 percent on complex indirect attacks.
The method generalizes across diverse threat scenarios without retraining.
Task performance improves, including a 5.85 percent gain on GSM8K and better results on knowledge benchmarks.
Safety is added at the trajectory level without measurable extra inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internalization approach could be tested on other alignment problems such as reducing hallucination or bias.
Training on a mixture of reflection data from multiple teacher models might increase robustness to teacher-specific biases.
Measuring whether the added reflection steps remain stable under distribution shift in user prompts would be a direct next experiment.

Load-bearing premise

High-quality reflection data from a teacher model can be internalized via reinforcement learning to produce robust autonomous self-reflection that generalizes without creating new vulnerabilities or overhead.

What would settle it

A new set of indirect jailbreak prompts never seen in training that causes defense success rate to drop below 70 percent while task performance on GSM8K remains unchanged or declines.

Figures

Figures reproduced from arXiv: 2605.20654 by Bo Zou, Chaochao Lu, Chao Yang, Jiachen Ma, Jiawen Zhang, Xiangtian Li.

**Figure 1.** Figure 1: Correlation between the position of the first occurrence of harmful tokens and the attack success rate (ASR). While direct jailbreaks (blue) manifest immediately, indirect attacks (red) exhibit a stealthy latency, with harmful content emerging only after 20 tokens. This delay enables malicious intent to bypass surfacelevel safety alignment, leading to significantly higher ASRs than direct attacks. solvin… view at source ↗

**Figure 2.** Figure 2: The framework of REFLECTOR. In Stage 1 (SFT), the model learns the “search-and-recovery” reflection pattern from teacher-guided data. In Stage 2 (RL), the model undergoes self-improvement via GDPO, guided by a hybrid reward function that jointly optimizes for final response safety (rsafety) and the validity of the reflection process (rreflect). disrupt the generation process. Thus, learning when and how to… view at source ↗

**Figure 3.** Figure 3: Impact of safety data scaling. (a) Increasing safety data yields initial gains but ultimately degrades general performance due to over-alignment. (b) Higher safety ratios consistently strengthen reflective defenses against overtly harmful queries. model does not produce explicit reflection markers without prior SFT, we prepend each query with an instruction that specifies the required reflection format du… view at source ↗

read the original abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reflector presents a two-stage SFT-then-RL pipeline that internalizes step-wise reflection to handle indirect jailbreaks, backed by strong reported defense rates and utility gains.

read the letter

Reflector gives a clear path to embedding step-wise self-reflection into LLMs so they can catch indirect jailbreaks on their own. The two-stage training—SFT from a teacher followed by RL with outcome and validity rewards—seems to deliver both safety and some capability gains. The paper does a good job laying out how surface-level alignment falls short against multi-step attacks that exploit the generation process. By focusing on internalizing reflection at the trajectory level, they move beyond prompt-based or post-hoc fixes. The empirical side shows defense success rates above 90 percent across various indirect threats, plus a 5.85 percent improvement on GSM8K and gains on knowledge benchmarks. That combination of safety and utility is what makes it stand out from many alignment papers that sacrifice one for the other. The methods look solid. The teacher SFT creates structured reflection examples, and the RL stage adds supervision to keep the reflections honest and effective. No obvious circularity in the setup, and the stress test confirms the construction supports the claims without hidden flaws. A minor soft spot is the level of detail on attack construction and statistical significance in the experiments. More explicit comparisons to recent indirect jailbreak defenses would help readers place the gains. These are not fatal but would make the case tighter. This paper is for the LLM safety community, especially those building or evaluating deployed systems. Practitioners looking for methods that scale without heavy overhead will get the most out of it. I would recommend sending it for peer review. The core idea is practical and the results merit closer examination by referees.

Referee Report

0 major / 3 minor

Summary. The paper introduces REFLECTOR, a two-stage framework for internalizing step-wise self-reflection in LLMs to defend against indirect jailbreak attacks. The first stage uses teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT). The second stage applies reinforcement learning (RL) with outcome-driven and reward-validity supervision to enable autonomous self-reflection during generation. Empirical results claim Defense Success Rates (DSR) exceeding 90% against complex indirect attacks with robust generalization across threat scenarios, plus a 5.85% gain on GSM8K and improvements on knowledge-intensive benchmarks.

Significance. If the results hold, this work is significant for LLM safety research. It moves beyond surface-level alignment by embedding trajectory-level reflection via a practical SFT-then-RL pipeline, which could scale to other safety properties while preserving or enhancing utility as shown by the GSM8K gains. The explicit use of outcome-driven and reward-validity signals in RL is a concrete strength that supports the claim of autonomous reflection without added overhead.

minor comments (3)

[Abstract and Experiments] The abstract and experimental results section would benefit from explicit mention of the number of attack instances, attack construction protocol, and statistical significance tests supporting the DSR >90% and GSM8K claims.
[Method (RL stage)] Clarify in the methods how the reward-validity supervision is implemented to prevent reward hacking during RL; while the overall procedure is consistent, a short pseudocode or equation would improve reproducibility.
[Figures] Figure captions and legends should more clearly distinguish between different indirect attack variants and baseline defenses to aid reader interpretation of the generalization results.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of REFLECTOR and the recommendation for minor revision. We appreciate the recognition that the two-stage SFT-then-RL pipeline with outcome-driven and validity rewards represents a meaningful advance in embedding trajectory-level reflection for indirect jailbreak defense while preserving utility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical two-stage training procedure (teacher-guided SFT to seed reflection patterns, followed by RL with outcome-driven and reward-validity signals) whose performance is evaluated on external benchmarks such as DSR against indirect jailbreaks and accuracy on GSM8K. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on measured generalization across threat scenarios rather than any reduction of outputs to inputs by construction, rendering the framework self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the transferability of teacher-generated reflection patterns and the ability of RL to produce autonomous safety behavior without side effects.

axioms (2)

domain assumption Teacher-guided generation produces high-quality reflection data suitable for SFT that transfers to autonomous use.
Invoked in the first stage of the framework to establish structured reflection patterns.
domain assumption Outcome-driven and reward-validity supervision in RL can instill robust self-reflection without degrading general capabilities.
Central to the second stage and the claim of both safety and utility gains.

pith-pipeline@v0.9.0 · 5730 in / 1328 out tokens · 34316 ms · 2026-05-21T06:12:12.821336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a dual reward function... r(τ)=r_safety(y)+r_reflect(z,y) ... +λ if reflection and HarmCLS(y)=1, -λ if reflection and HarmCLS(y)=0, 0 no reflection
IndisputableMonolith/Foundation/ArrowOfTime.lean z_monotone_absolute unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stage I: Reflection Capability Injection via Supervised Fine-Tuning ... Stage II: Dual-Reward Enhancement via Reinforcement Learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 21 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:1901.10995 , year=

Go-explore: a new approach for hard-exploration problems , author=. arXiv preprint arXiv:1901.10995 , year=

work page arXiv 1901
[3]

by richard’s sutton , author=

Reinforcement learning: An introduction. by richard’s sutton , author=. SIAM Rev , volume=. 2021 , publisher=

work page 2021
[4]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[6]

Advances in neural information processing systems , volume=

Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

work page
[7]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[9]

2023 , eprint=

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

work page 2023
[10]

2025 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint=

work page 2025
[11]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[12]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[13]

Safety alignment should be made more than just a few tokens deep

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv
[14]

arXiv preprint arXiv:2502.02384 , year=

Stair: Improving safety alignment with introspective reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

work page arXiv
[15]

Satori: Reinforcement learning with chain-of-action-thought enhances LLM reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,2025

Satori: Reinforcement learning with chain-of-action-thought enhances llm reasoning via autoregressive search , author=. arXiv preprint arXiv:2502.02508 , year=

work page arXiv
[16]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

work page 2024
[17]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[18]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Towards veri- fying the geometric robustness of large-scale neural net- works

Do-not-answer: A dataset for evaluating safeguards in llms , author=. arXiv preprint arXiv:2308.13387 , year=

work page arXiv
[20]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

work page
[22]

arXiv preprint arXiv:2311.08268 (2023)

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily , author=. arXiv preprint arXiv:2311.08268 , year=

work page arXiv
[23]

2024 , eprint=

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers , author=. 2024 , eprint=

work page 2024
[24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[25]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025
[26]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

work page arXiv
[28]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

Measuring short-form factuality in large language models

Measuring short-form factuality in large language models , author=. arXiv preprint arXiv:2411.04368 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Adversarial glue: A multi-task benchmark for robustness evaluation of language models , author=. arXiv preprint arXiv:2111.02840 , year=

work page arXiv
[33]

Skywork Open Reasoner 1 Technical Report

Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

QwQ-32B: Embracing the Power of Reinforcement Learning , url =

Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

work page
[36]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. arXiv preprint arXiv:2601.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[39]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Attacks, defenses and evaluations for llm conversation safety: A survey

Attacks, defenses and evaluations for llm conversation safety: A survey , author=. arXiv preprint arXiv:2402.09283 , year=

work page arXiv
[42]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[43]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Using an llm to help with code understanding , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

work page
[48]

, author=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=

work page
[49]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

How robust is google's bard to adversarial image attacks? , author=. arXiv preprint arXiv:2309.11751 , year=

work page arXiv
[51]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Jailbreaking prompt attack: A controllable adversarial attack against diffusion models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025
[52]

Diagnostic pathology , volume=

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology--a recent scoping review , author=. Diagnostic pathology , volume=. 2024 , publisher=

work page 2024
[53]

Simulating classroom education with llm-empowered agents , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[54]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Deepinception: Hypnotize large language model to be jailbreaker , author=. arXiv preprint arXiv:2311.03191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

arXiv preprint arXiv:2511.12869 , year=

On the Fundamental Limits of LLMs at Scale , author=. arXiv preprint arXiv:2511.12869 , year=

work page arXiv
[58]

arXiv preprint arXiv:2505.20259 , year=

Lifelong Safety Alignment for Language Models , author=. arXiv preprint arXiv:2505.20259 , year=

work page arXiv

[1] [1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:1901.10995 , year=

Go-explore: a new approach for hard-exploration problems , author=. arXiv preprint arXiv:1901.10995 , year=

work page arXiv 1901

[3] [3]

by richard’s sutton , author=

Reinforcement learning: An introduction. by richard’s sutton , author=. SIAM Rev , volume=. 2021 , publisher=

work page 2021

[4] [4]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025

[6] [6]

Advances in neural information processing systems , volume=

Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[9] [9]

2023 , eprint=

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

work page 2023

[10] [10]

2025 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint=

work page 2025

[11] [11]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[12] [12]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[13] [13]

Safety alignment should be made more than just a few tokens deep

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2502.02384 , year=

Stair: Improving safety alignment with introspective reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

work page arXiv

[15] [15]

Satori: Reinforcement learning with chain-of-action-thought enhances LLM reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,2025

Satori: Reinforcement learning with chain-of-action-thought enhances llm reasoning via autoregressive search , author=. arXiv preprint arXiv:2502.02508 , year=

work page arXiv

[16] [16]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

work page 2024

[17] [17]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024

[18] [18]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Towards veri- fying the geometric robustness of large-scale neural net- works

Do-not-answer: A dataset for evaluating safeguards in llms , author=. arXiv preprint arXiv:2308.13387 , year=

work page arXiv

[20] [20]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

work page

[22] [22]

arXiv preprint arXiv:2311.08268 (2023)

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily , author=. arXiv preprint arXiv:2311.08268 , year=

work page arXiv

[23] [23]

2024 , eprint=

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers , author=. 2024 , eprint=

work page 2024

[24] [24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[25] [25]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025

[26] [26]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

work page arXiv

[28] [28]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page

[30] [31]

Measuring short-form factuality in large language models

Measuring short-form factuality in large language models , author=. arXiv preprint arXiv:2411.04368 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Adversarial glue: A multi-task benchmark for robustness evaluation of language models , author=. arXiv preprint arXiv:2111.02840 , year=

work page arXiv

[32] [33]

Skywork Open Reasoner 1 Technical Report

Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

QwQ-32B: Embracing the Power of Reinforcement Learning , url =

Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

work page

[35] [36]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [37]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. arXiv preprint arXiv:2601.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [38]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[38] [39]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Attacks, defenses and evaluations for llm conversation safety: A survey

Attacks, defenses and evaluations for llm conversation safety: A survey , author=. arXiv preprint arXiv:2402.09283 , year=

work page arXiv

[41] [42]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page

[42] [43]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [44]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [45]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [46]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [47]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Using an llm to help with code understanding , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

work page

[47] [48]

, author=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=

work page

[48] [49]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [50]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

How robust is google's bard to adversarial image attacks? , author=. arXiv preprint arXiv:2309.11751 , year=

work page arXiv

[50] [51]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Jailbreaking prompt attack: A controllable adversarial attack against diffusion models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025

[51] [52]

Diagnostic pathology , volume=

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology--a recent scoping review , author=. Diagnostic pathology , volume=. 2024 , publisher=

work page 2024

[52] [53]

Simulating classroom education with llm-empowered agents , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025

[53] [54]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [55]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

work page

[55] [56]

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Deepinception: Hypnotize large language model to be jailbreaker , author=. arXiv preprint arXiv:2311.03191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [57]

arXiv preprint arXiv:2511.12869 , year=

On the Fundamental Limits of LLMs at Scale , author=. arXiv preprint arXiv:2511.12869 , year=

work page arXiv

[57] [58]

arXiv preprint arXiv:2505.20259 , year=

Lifelong Safety Alignment for Language Models , author=. arXiv preprint arXiv:2505.20259 , year=

work page arXiv