pith. machine review for the scientific record. sign in

arxiv: 2602.08813 · v2 · submitted 2026-02-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Robust Policy Optimization to Prevent Catastrophic Forgetting

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords robust RLHFcatastrophic forgettingpolicy optimizationsafety alignmentKL divergencemax-min optimizationfine-tuning robustness
0
0 comments X

The pith

A max-min objective over KL neighborhoods makes RLHF policies stable against downstream fine-tuning loss of safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLHF optimizes reward only at the current policy, leaving it vulnerable to sharp drops in safety when later fine-tuned on new tasks. FRPO instead solves a max-min problem that guarantees high reward across an entire ball of policies reachable by standard adaptation steps. The method modifies GRPO to implement this without extra computation. Experiments on multiple base models show large reductions in safety degradation under both SFT and RL fine-tuning while task performance stays intact. The same pattern holds in a math reasoning RL setting where accuracy is preserved after further adaptation.

Core claim

By replacing the standard RLHF objective with a max-min formulation that optimizes the worst-case reward inside a KL-bounded neighborhood of policies, the resulting base policy maintains high reward even after subsequent standard fine-tuning steps, thereby reducing catastrophic forgetting of earlier behaviors such as safety alignment.

What carries the argument

The max-min objective in FRPO, which maximizes the minimum reward over all policies inside a KL divergence ball around the current policy.

If this is right

  • Safety degradation is substantially reduced across multiple base models under both supervised fine-tuning and reinforcement learning downstream regimes.
  • Downstream task performance remains comparable to policies trained with standard objectives.
  • The robustness benefit extends to math-focused RL settings where accuracy is preserved under subsequent fine-tuning.
  • The algorithm requires no extra computation beyond the base GRPO procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-training robustness at the RLHF stage may reduce the need for separate forgetting-prevention techniques applied after every downstream update.
  • The KL-ball approach could be tested on continual learning problems outside language modeling where sequential adaptation also causes forgetting.
  • If real fine-tuning trajectories often exit the assumed KL neighborhood, the method's protection would be limited to only mild adaptation steps.

Load-bearing premise

The set of policies actually reached by standard downstream fine-tuning lies inside the KL-bounded neighborhood used in the max-min optimization.

What would settle it

Apply FRPO, then run a downstream fine-tuning step whose resulting policy lies inside the KL ball yet produces a large drop in safety reward.

Figures

Figures reproduced from arXiv: 2602.08813 by Adel Javanmard, George Pappas, Hamed Hassani, Mahdi Sabbaghi.

Figure 1
Figure 1. Figure 1: Illustration of FRPO. Standard RLHF finds high-reward policies that may lie in sharp regions, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (left/middle) The safety reward for Mistral and Qwen as the policy moves from the base by increasing the KL during fine-tuning, when sweeping λ, and evaluated on a split of the safety prompts; λ = 0.2 better preserves the safety reward for both models and yields the most flat landscape. (right) KL is the average sequence-level on safety prompts which increases under a constant-lr schedule. Skipping the KL … view at source ↗
Figure 3
Figure 3. Figure 3: Safety evaluation after Alpaca SFT on HarmBench ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Safety metrics during GSM8k SFT for Mistral and Qwen models. Our method maintains higher [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (left) Fine-tuning the models on UltraFeedback with GRPO leads to a significant increase in the average response length, inducing more detailed answers to harmful demands. (right) Helpfulness vs. Safety score (1 − StrongREJECT score) for Mistral models after GRPO on UltraFeedback. λ = 0.5 has better safety score but also lower helpfulness. λ = 2.0 and GRPO have the higher helpfulness score. Helpfulness-saf… view at source ↗
Figure 6
Figure 6. Figure 6: Safety training curves for Mistral and Qwen with GRPO and FRPO for [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We have two rewards for math training: (left) shows that the format reward (whether the response contains any final answer) increases to ∼0.9 for all the training models; (right) shows that the correctness reward increases similarly for all model, and roughly reaches 0.7. B.2 Math Training As described in Section 5.2, we use a final answer verifier as the 0-1 reward for the RL training. We also add a small… view at source ↗
Figure 8
Figure 8. Figure 8: During training, among KL values that preserve the helpfulness reward and avoid reducing the policy [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The policy collapses in the absence of the derived baseline in Equation ( [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (left) The bias in the gradient estimator does not show itself in the safety training curves and the training without the jackknife trick looks similar. (right) However, at the downstream time, the model that is trained without the jackknife trick is less robust and the StrongREJECT score (↓ is better) grows more compared to the model with the Jackknife trick. ∇J(θ) ≈ Ex∼p " 1 G G ∑ i=1 1 |yi | |yi | ∑ t=… view at source ↗
Figure 11
Figure 11. Figure 11: The model trained with FRPO and λ = 0.5 fine-tuned on Alpaca; (left/middle) we consider a higher choice of learning rate for SFT on Alpaca (described in Section 5.1.1) to show that it slightly degrades the safety but the helpfulness score is highly impacted. (right) This shows that a higher learning rate still changes the KL controllably but with higher slope, confirming our constraint in Section 3. are h… view at source ↗
Figure 12
Figure 12. Figure 12: (left) We measure the SFT loss during fine-tuning to show that the FRPO-trained model fits the downstream task as well as other models. (right) FRPO also keeps the general capabilities higher that GRPO after fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The refusal rate and the StrongREJECT score of the models during fine-tuning on GSM8K. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
read the original abstract

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Fine-tuning Robust Policy Optimization (FRPO), a modification of GRPO that replaces the standard objective with a max-min formulation over reward within a fixed KL ball around the base policy. The goal is to produce base policies whose reward remains stable under subsequent downstream SFT or RL fine-tuning, thereby reducing catastrophic forgetting of safety behaviors. The authors claim the modification incurs no extra computation and present empirical results showing reduced safety degradation across multiple base models and fine-tuning regimes while preserving downstream task performance; they also report results in a math-focused RL setting.

Significance. If the central empirical claim holds under the stated assumptions, the work offers a practical pre-fine-tuning intervention that could reduce reliance on post-hoc forgetting mitigation in multi-stage LLM pipelines. The no-extra-computation property and the extension to both SFT and RL downstream regimes are notable strengths for adoption.

major comments (2)
  1. [§3 and §4] §3 (FRPO objective) and §4 (experiments): The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.
  2. [§3.2] §3.2 (algorithmic modification): The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2: axis labels and legend entries should explicitly state the KL radius used for each FRPO run so readers can assess sensitivity.
  2. [Related Work] Related-work section: the discussion of prior robust RLHF methods should cite the specific KL-ball radii or neighborhood sizes employed in those works for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate the requested clarifications and measurements in the revised manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.

    Authors: We agree that reporting the actual post-fine-tuning KL divergences would directly test whether the downstream policies fall inside the robustness ball. In the revision we will add these measurements for all SFT and RL regimes and base models, computed with the same tokenizer and reference policy used in FRPO. The KL radius was selected from values commonly observed in the fine-tuning literature; the new tables will allow readers to verify coverage empirically. revision: yes

  2. Referee: [§3.2] The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.

    Authors: We will expand §3.2 with a full derivation and pseudocode. The inner maximization is solved in closed form by reweighting the existing advantage estimates with a single scalar Lagrange multiplier for the KL constraint; this multiplier is obtained via a lightweight line search on the same batch of samples already drawn for the GRPO outer update. Consequently, no new policy samples, forward passes, or gradient steps are required, and the only added hyperparameter is the fixed KL radius (set once, not tuned per run). The revised text will include wall-clock timings confirming identical per-iteration cost. revision: yes

Circularity Check

0 steps flagged

No circularity: max-min objective defined directly from reward and KL ball

full rationale

The derivation introduces FRPO as a max-min formulation that optimizes worst-case reward inside a fixed KL neighborhood around the base policy. This is a standard robust-optimization construction stated directly in terms of the reward function and the KL divergence constraint; it does not reduce to any fitted parameter, self-referential prediction, or load-bearing self-citation. No uniqueness theorem, ansatz smuggled via prior work, or renaming of an empirical pattern is invoked. The central claim therefore remains independent of its own outputs and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling assumption that downstream fine-tuning corresponds to small KL shifts; no explicit free parameters or invented entities are introduced beyond standard RLHF components.

axioms (1)
  • domain assumption Downstream adaptations remain inside a KL-bounded neighborhood of the base policy
    This neighborhood defines the set over which the max-min is taken and is required for the robustness guarantee to transfer to actual fine-tuning.

pith-pipeline@v0.9.0 · 5511 in / 1257 out tokens · 31506 ms · 2026-05-16T05:30:14.613398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  2. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 25 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,

    Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,

  3. [3]

    Opencodeinstruct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030,

    Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majum- dar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030,

  4. [4]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

  5. [5]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  7. [7]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  8. [8]

    Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,

  9. [9]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

    URL https: //proceedings.neurips.cc/paper_files/paper/2015/file/64223ccf70bbb65a3a4aceac37e21016-Paper.pdf. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30,

  10. [10]

    Beyond variance reduction: Understanding the true impact of baselines on policy optimization

    Wesley Chung, Valentin Thomas, Marlos C Machado, and Nicolas Le Roux. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. InInternational conference on machine learning, pages 1999–2009. PMLR,

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  12. [12]

    Ultrafeedback: Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377,

  13. [13]

    Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

  14. [14]

    Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894,

    Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894,

  15. [15]

    SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

    Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, and Holger Boche. Safemerge: Preserving safety alignment in fine-tuned large language models via selective layer-wise model merging. arXiv preprint arXiv:2503.17239,

  16. [16]

    Fine- tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305,

    Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine- tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305,

  17. [17]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

  18. [18]

    Maximum entropy rl (provably) solves some robust rl problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257,

  19. [19]

    Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.Advances in neural information processing systems, 34:20436–20446, 2021a

    Yingjie Fei, Zhuoran Yang, Yudong Chen, and Zhaoran Wang. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.Advances in neural information processing systems, 34:20436–20446, 2021a. Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. Risk-sensitive reinforcement learning with function approx- imation: A debiasing a...

  20. [20]

    Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails.arXiv preprint arXiv:2501.09004,

  21. [21]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

  22. [22]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  23. [23]

    Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

    13 Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488,

  24. [24]

    Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. arXiv preprint arXiv:2403.01244,

  25. [25]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

  26. [26]

    Mistral 7B

    URL https://arxiv.org/abs/2310.06825. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghal- lah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094– 47165,

  27. [27]

    Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,

  28. [28]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

  29. [29]

    Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,

    Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, and Aviral Kumar. Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,

  30. [30]

    Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

  31. [31]

    Mixout: Effective regularization to finetune large-scale pretrained language models.arXiv preprint arXiv:1909.11299,

    Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to finetune large-scale pretrained language models.arXiv preprint arXiv:1909.11299,

  32. [32]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  33. [33]

    RewardBench 2: Advancing Reward Model Evaluation

    Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,

  34. [34]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

  35. [35]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. Rupert G. Miller. The jackknife–a review.Biometrika, 61(1):1–15,

  36. [36]

    URL http://www.jstor.org/stable/2334280

    ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334280. Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprint arXiv:2006.04884,

  37. [37]

    Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang

    URL https://arxiv.org/abs/2502.02421. Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling.arXiv preprint arXiv:1909.02060,

  38. [38]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    15 Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693,

  39. [39]

    Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,

  40. [40]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ b0bc711f48724237b38823c4d9cee10b-Paper-Conference.pdf

    doi: 10.52202/079017-3092. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ b0bc711f48724237b38823c4d9cee10b-Paper-Conference.pdf. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingr...

  41. [41]

    Qwen2.5 Technical Report

    URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

  42. [42]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,

  43. [43]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731,

  44. [44]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  45. [45]

    Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

    Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

  46. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    16 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  47. [47]

    URL https://doi.org/10.1137/16M1058297

    doi: 10.1137/16M1058297. URL https://doi.org/10.1137/16M1058297. Alexander Shapiro and Anton Kleywegt. Minimax analysis of stochastic problems.Optimization Methods and Software, 17(3):523–542,

  48. [48]

    Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi

    doi: 10.1080/1055678021000034008. Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi. Certifying some distributional robustness with principled adversarial training.arXiv preprint arXiv:1710.10571,

  49. [49]

    Distributionally Robust Reinforcement Learning

    Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708,

  50. [50]

    Lamol: Language modeling for lifelong language learning

    Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee. Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329,

  51. [51]

    Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,

  52. [52]

    Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models.arXiv preprint arXiv:2502.11555,

    Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, and Bo Zheng. Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models.arXiv preprint arXiv:2502.11555,

  53. [53]

    Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,

    Eugene Vinitsky, Yuqing Du, Kanaad Parvate, Kathy Jang, Pieter Abbeel, and Alexandre Bayen. Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,

  54. [54]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671,

  55. [55]

    Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949,

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949,

  56. [56]

    Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,

    Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,

  57. [57]

    Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,

    Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,

  58. [58]

    The phenomenon arises because gradient updates for new objectives overwrite parameters critical to earlier tasks (Goodfellow et al., 2013; Kirkpatrick et al., 2017)

    18 A Additional Related Work Catastrophic forgetting.Catastrophic forgetting—the abrupt loss of previously learned knowledge when training on new tasks—has been studied since early neural network research (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). The phenomenon arises because gradient updates for new objectives overwrite parameters critic...

  59. [59]

    stochastically resets toward pretrained weights, while other methods prevent the policy from drifting too far (Li and Hoiem, 2017; Schulman et al., 2017; Lee et al., 2019), or ensure that the model’s latent representations are preserved (Kirkpatrick et al., 2017; Aghajanyan et al., 2020; Pan et al., 2024). Rehearsal methods augment downstream training wit...

  60. [60]

    and progressive networks (Rusu et al., 2016)), thereby reducing parameter interference between the objectives (Hsu et al., 2024; Qiao and Mahdavi, 2024), or aiming to keep the features orthogonal (Wang et al., 2023; Qiao and Mahdavi, 2024). Finally, recent model merging methods aim to mathematically combine task-specific models post-hoc to retain all capa...

  61. [61]

    Llama-3.1-8B-Instruct-RM-RB2

    and are used to preserve general instruction-following behavior and mitigate over-refusal. Reward models.We use separate rewards for harmful vs. harmless subsets. For harmful prompts, the reward is 1−(s 1 +s 2)/2, where s1,s 2 are scores from OpenAI Moderation API and StrongREJECT judge (Markov et al., 2023; Souly et al., 2024). For harmless prompts, we u...

  62. [62]

    \boxed{ }

    with r=64 and α=64 for safety training of all the models. The learning-rate is lr=10 −5 for the Mistral models and lr=3×10 −5 for Qwen models. In order to keep the gradient norm consistent across values of λ, we omitted the λ factor in Equation (4.1) and tuned β to keep the final KL bounded, rather than changing the learning-rate for each λ. We used 2 epo...

  63. [63]

    Llama-3.1-8B-Instruct-RM-RB2

    (Miller, 1974; Jiao and Han, 2020). The jackknife targets this bias by using the leave-one-out terms ˆg(X−j ) to estimate the leading 1/k term in the Taylor expansion of the bias, and subtracting it via the linear combination in ˜g(X). As a result, the O(1/k) term cancels and the remaining bias isO(1/k 2). F Additional Experiments F.1 High learning-rate S...

  64. [64]

    exhibit an initial jump in loss. These models are adversarially trained to 25 Figure 12:(left)We measure the SFT loss during fine-tuning to show that the FRPO-trained model fits the downstream task as well as other models.(right)FRPO also keeps the general capabilities higher that GRPO after fine-tuning. Figure 13: The refusal rate and the StrongREJECT sc...