arxiv: 2602.08813 · v2 · submitted 2026-02-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Robust Policy Optimization to Prevent Catastrophic Forgetting

Mahdi Sabbaghi , George Pappas , Adel Javanmard , Hamed Hassani

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords robust RLHFcatastrophic forgettingpolicy optimizationsafety alignmentKL divergencemax-min optimizationfine-tuning robustness

0 comments

The pith

A max-min objective over KL neighborhoods makes RLHF policies stable against downstream fine-tuning loss of safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLHF optimizes reward only at the current policy, leaving it vulnerable to sharp drops in safety when later fine-tuned on new tasks. FRPO instead solves a max-min problem that guarantees high reward across an entire ball of policies reachable by standard adaptation steps. The method modifies GRPO to implement this without extra computation. Experiments on multiple base models show large reductions in safety degradation under both SFT and RL fine-tuning while task performance stays intact. The same pattern holds in a math reasoning RL setting where accuracy is preserved after further adaptation.

Core claim

By replacing the standard RLHF objective with a max-min formulation that optimizes the worst-case reward inside a KL-bounded neighborhood of policies, the resulting base policy maintains high reward even after subsequent standard fine-tuning steps, thereby reducing catastrophic forgetting of earlier behaviors such as safety alignment.

What carries the argument

The max-min objective in FRPO, which maximizes the minimum reward over all policies inside a KL divergence ball around the current policy.

If this is right

Safety degradation is substantially reduced across multiple base models under both supervised fine-tuning and reinforcement learning downstream regimes.
Downstream task performance remains comparable to policies trained with standard objectives.
The robustness benefit extends to math-focused RL settings where accuracy is preserved under subsequent fine-tuning.
The algorithm requires no extra computation beyond the base GRPO procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-training robustness at the RLHF stage may reduce the need for separate forgetting-prevention techniques applied after every downstream update.
The KL-ball approach could be tested on continual learning problems outside language modeling where sequential adaptation also causes forgetting.
If real fine-tuning trajectories often exit the assumed KL neighborhood, the method's protection would be limited to only mild adaptation steps.

Load-bearing premise

The set of policies actually reached by standard downstream fine-tuning lies inside the KL-bounded neighborhood used in the max-min optimization.

What would settle it

Apply FRPO, then run a downstream fine-tuning step whose resulting policy lies inside the KL ball yet produces a large drop in safety reward.

Figures

Figures reproduced from arXiv: 2602.08813 by Adel Javanmard, George Pappas, Hamed Hassani, Mahdi Sabbaghi.

**Figure 2.** Figure 2: (left/middle) The safety reward for Mistral and Qwen as the policy moves from the base by increasing the KL during fine-tuning, when sweeping λ, and evaluated on a split of the safety prompts; λ = 0.2 better preserves the safety reward for both models and yields the most flat landscape. (right) KL is the average sequence-level on safety prompts which increases under a constant-lr schedule. Skipping the KL … view at source ↗

**Figure 3.** Figure 3: Safety evaluation after Alpaca SFT on HarmBench ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Safety metrics during GSM8k SFT for Mistral and Qwen models. Our method maintains higher [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: (left) Fine-tuning the models on UltraFeedback with GRPO leads to a significant increase in the average response length, inducing more detailed answers to harmful demands. (right) Helpfulness vs. Safety score (1 − StrongREJECT score) for Mistral models after GRPO on UltraFeedback. λ = 0.5 has better safety score but also lower helpfulness. λ = 2.0 and GRPO have the higher helpfulness score. Helpfulness-saf… view at source ↗

**Figure 6.** Figure 6: Safety training curves for Mistral and Qwen with GRPO and FRPO for [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: We have two rewards for math training: (left) shows that the format reward (whether the response contains any final answer) increases to ∼0.9 for all the training models; (right) shows that the correctness reward increases similarly for all model, and roughly reaches 0.7. B.2 Math Training As described in Section 5.2, we use a final answer verifier as the 0-1 reward for the RL training. We also add a small… view at source ↗

**Figure 8.** Figure 8: During training, among KL values that preserve the helpfulness reward and avoid reducing the policy [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: The policy collapses in the absence of the derived baseline in Equation ( [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: (left) The bias in the gradient estimator does not show itself in the safety training curves and the training without the jackknife trick looks similar. (right) However, at the downstream time, the model that is trained without the jackknife trick is less robust and the StrongREJECT score (↓ is better) grows more compared to the model with the Jackknife trick. ∇J(θ) ≈ Ex∼p " 1 G G ∑ i=1 1 |yi | |yi | ∑ t=… view at source ↗

**Figure 11.** Figure 11: The model trained with FRPO and λ = 0.5 fine-tuned on Alpaca; (left/middle) we consider a higher choice of learning rate for SFT on Alpaca (described in Section 5.1.1) to show that it slightly degrades the safety but the helpfulness score is highly impacted. (right) This shows that a higher learning rate still changes the KL controllably but with higher slope, confirming our constraint in Section 3. are h… view at source ↗

**Figure 12.** Figure 12: (left) We measure the SFT loss during fine-tuning to show that the FRPO-trained model fits the downstream task as well as other models. (right) FRPO also keeps the general capabilities higher that GRPO after fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: The refusal rate and the StrongREJECT score of the models during fine-tuning on GSM8K. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FRPO turns GRPO into a max-min problem over a KL ball around the base policy to guard against safety loss in later fine-tuning, claims zero added cost, and reports empirical gains across models and regimes.

read the letter

The paper's main contribution is FRPO, which recasts the GRPO objective as a max-min over reward in a KL-bounded neighborhood of policies. The goal is to make the initial RLHF policy less brittle to standard downstream SFT or RL updates that can erase safety behaviors. They modify GRPO to implement this without extra computation and show in experiments that safety degradation drops substantially while task performance holds up, with a similar result in a math RL setting. This framing is new for RLHF: it shifts the robustness burden to the base training stage rather than adding post-hoc fixes later. The efficient modification and the multi-model, multi-regime results are the parts that feel practically useful if they replicate cleanly. The central assumption is that downstream fine-tuning trajectories stay inside the KL ball used in the max-min. The paper does not report direct measurements of actual KL distances after typical fine-tuning steps versus the radius chosen, so it is unclear whether the worst-case optimization covers the shifts that actually occur. If real updates often exit the ball, the protection may be limited to small perturbations. Experimental details on baselines and exact quantitative drops are also thin in the summary, which makes it harder to judge effect sizes. This is for groups working on RLHF pipelines and safety in multi-stage LLM training. A reader who wants concrete algorithmic ideas for robustness would get something usable from the formulation and the reported runs. It deserves peer review because the problem is real and the proposal is specific enough for referees to check the math and the experiments directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces Fine-tuning Robust Policy Optimization (FRPO), a modification of GRPO that replaces the standard objective with a max-min formulation over reward within a fixed KL ball around the base policy. The goal is to produce base policies whose reward remains stable under subsequent downstream SFT or RL fine-tuning, thereby reducing catastrophic forgetting of safety behaviors. The authors claim the modification incurs no extra computation and present empirical results showing reduced safety degradation across multiple base models and fine-tuning regimes while preserving downstream task performance; they also report results in a math-focused RL setting.

Significance. If the central empirical claim holds under the stated assumptions, the work offers a practical pre-fine-tuning intervention that could reduce reliance on post-hoc forgetting mitigation in multi-stage LLM pipelines. The no-extra-computation property and the extension to both SFT and RL downstream regimes are notable strengths for adoption.

major comments (2)

[§3 and §4] §3 (FRPO objective) and §4 (experiments): The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.
[§3.2] §3.2 (algorithmic modification): The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.

minor comments (2)

[Tables/Figures] Table 1 and Figure 2: axis labels and legend entries should explicitly state the KL radius used for each FRPO run so readers can assess sensitivity.
[Related Work] Related-work section: the discussion of prior robust RLHF methods should cite the specific KL-ball radii or neighborhood sizes employed in those works for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate the requested clarifications and measurements in the revised manuscript.

read point-by-point responses

Referee: [§3 and §4] The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.

Authors: We agree that reporting the actual post-fine-tuning KL divergences would directly test whether the downstream policies fall inside the robustness ball. In the revision we will add these measurements for all SFT and RL regimes and base models, computed with the same tokenizer and reference policy used in FRPO. The KL radius was selected from values commonly observed in the fine-tuning literature; the new tables will allow readers to verify coverage empirically. revision: yes
Referee: [§3.2] The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.

Authors: We will expand §3.2 with a full derivation and pseudocode. The inner maximization is solved in closed form by reweighting the existing advantage estimates with a single scalar Lagrange multiplier for the KL constraint; this multiplier is obtained via a lightweight line search on the same batch of samples already drawn for the GRPO outer update. Consequently, no new policy samples, forward passes, or gradient steps are required, and the only added hyperparameter is the fixed KL radius (set once, not tuned per run). The revised text will include wall-clock timings confirming identical per-iteration cost. revision: yes

Circularity Check

0 steps flagged

No circularity: max-min objective defined directly from reward and KL ball

full rationale

The derivation introduces FRPO as a max-min formulation that optimizes worst-case reward inside a fixed KL neighborhood around the base policy. This is a standard robust-optimization construction stated directly in terms of the reward function and the KL divergence constraint; it does not reduce to any fitted parameter, self-referential prediction, or load-bearing self-citation. No uniqueness theorem, ansatz smuggled via prior work, or renaming of an empirical pattern is invoked. The central claim therefore remains independent of its own outputs and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling assumption that downstream fine-tuning corresponds to small KL shifts; no explicit free parameters or invented entities are introduced beyond standard RLHF components.

axioms (1)

domain assumption Downstream adaptations remain inside a KL-bounded neighborhood of the base policy
This neighborhood defines the set over which the max-min is taken and is required for the robustness guarantee to transfer to actual fine-tuning.

pith-pipeline@v0.9.0 · 5511 in / 1257 out tokens · 31506 ms · 2026-05-16T05:30:14.613398+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

max_π inf_Q E[r(x,y)] s.t. E[KL(Q∥π)]≤ρ (Eq. 3.1); dual yields −λ log E[exp(−r/λ)] entropic risk (Eq. 3.5)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FRPO modifies GRPO with no extra compute; λ controls risk aversion; recovers GRPO as λ→∞

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 7.0

ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 25 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,

work page arXiv 2008
[3]

Opencodeinstruct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030,

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majum- dar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030,

work page arXiv
[4]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,

work page arXiv
[9]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

URL https: //proceedings.neurips.cc/paper_files/paper/2015/file/64223ccf70bbb65a3a4aceac37e21016-Paper.pdf. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30,

work page 2015
[10]

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

Wesley Chung, Valentin Thomas, Marlos C Machado, and Nicolas Le Roux. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. InInternational conference on machine learning, pages 1999–2009. PMLR,

work page 1999
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377,

work page arXiv
[13]

Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

work page arXiv
[14]

Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894,

Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894,

work page arXiv 2003
[15]

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, and Holger Boche. Safemerge: Preserving safety alignment in fine-tuned large language models via selective layer-wise model merging. arXiv preprint arXiv:2503.17239,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Fine- tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305,

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine- tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305,

work page arXiv 2002
[17]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Maximum entropy rl (provably) solves some robust rl problems

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257,

work page arXiv
[19]

Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.Advances in neural information processing systems, 34:20436–20446, 2021a

Yingjie Fei, Zhuoran Yang, Yudong Chen, and Zhaoran Wang. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.Advances in neural information processing systems, 34:20436–20446, 2021a. Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. Risk-sensitive reinforcement learning with function approx- imation: A debiasing a...

work page arXiv 2010
[20]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails.arXiv preprint arXiv:2501.09004,

work page arXiv
[21]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

13 Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. arXiv preprint arXiv:2403.01244,

work page arXiv
[25]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Mistral 7B

URL https://arxiv.org/abs/2310.06825. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghal- lah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094– 47165,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,

work page arXiv 1912
[28]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, and Aviral Kumar. Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,

work page arXiv
[30]

Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

work page arXiv
[31]

Mixout: Effective regularization to finetune large-scale pretrained language models.arXiv preprint arXiv:1909.11299,

Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to finetune large-scale pretrained language models.arXiv preprint arXiv:1909.11299,

work page arXiv 1909
[32]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Rupert G. Miller. The jackknife–a review.Biometrika, 61(1):1–15,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

URL http://www.jstor.org/stable/2334280

ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334280. Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprint arXiv:2006.04884,

work page arXiv 2006
[37]

Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang

URL https://arxiv.org/abs/2502.02421. Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling.arXiv preprint arXiv:1909.02060,

work page arXiv 1909
[38]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

15 Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,

work page arXiv
[40]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ b0bc711f48724237b38823c4d9cee10b-Paper-Conference.pdf

doi: 10.52202/079017-3092. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ b0bc711f48724237b38823c4d9cee10b-Paper-Conference.pdf. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingr...

work page doi:10.52202/079017-3092 2024
[41]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

work page arXiv
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

16 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

URL https://doi.org/10.1137/16M1058297

doi: 10.1137/16M1058297. URL https://doi.org/10.1137/16M1058297. Alexander Shapiro and Anton Kleywegt. Minimax analysis of stochastic problems.Optimization Methods and Software, 17(3):523–542,

work page doi:10.1137/16m1058297
[48]

Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi

doi: 10.1080/1055678021000034008. Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi. Certifying some distributional robustness with principled adversarial training.arXiv preprint arXiv:1710.10571,

work page doi:10.1080/1055678021000034008
[49]

Distributionally Robust Reinforcement Learning

Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[50]

Lamol: Language modeling for lifelong language learning

Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee. Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329,

work page arXiv 1909
[51]

Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,

work page arXiv
[52]

Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models.arXiv preprint arXiv:2502.11555,

Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, and Bo Zheng. Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models.arXiv preprint arXiv:2502.11555,

work page arXiv
[53]

Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,

Eugene Vinitsky, Yuqing Du, Kanaad Parvate, Kathy Jang, Pieter Abbeel, and Alexandre Bayen. Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,

work page arXiv 2008
[54]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671,

work page 2023
[55]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949,

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949,

work page arXiv
[56]

Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,

work page arXiv
[57]

Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,

work page arXiv
[58]

The phenomenon arises because gradient updates for new objectives overwrite parameters critical to earlier tasks (Goodfellow et al., 2013; Kirkpatrick et al., 2017)

18 A Additional Related Work Catastrophic forgetting.Catastrophic forgetting—the abrupt loss of previously learned knowledge when training on new tasks—has been studied since early neural network research (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). The phenomenon arises because gradient updates for new objectives overwrite parameters critic...

work page 1989
[59]

stochastically resets toward pretrained weights, while other methods prevent the policy from drifting too far (Li and Hoiem, 2017; Schulman et al., 2017; Lee et al., 2019), or ensure that the model’s latent representations are preserved (Kirkpatrick et al., 2017; Aghajanyan et al., 2020; Pan et al., 2024). Rehearsal methods augment downstream training wit...

work page 2017
[60]

and progressive networks (Rusu et al., 2016)), thereby reducing parameter interference between the objectives (Hsu et al., 2024; Qiao and Mahdavi, 2024), or aiming to keep the features orthogonal (Wang et al., 2023; Qiao and Mahdavi, 2024). Finally, recent model merging methods aim to mathematically combine task-specific models post-hoc to retain all capa...

work page 2016
[61]

Llama-3.1-8B-Instruct-RM-RB2

and are used to preserve general instruction-following behavior and mitigate over-refusal. Reward models.We use separate rewards for harmful vs. harmless subsets. For harmful prompts, the reward is 1−(s 1 +s 2)/2, where s1,s 2 are scores from OpenAI Moderation API and StrongREJECT judge (Markov et al., 2023; Souly et al., 2024). For harmless prompts, we u...

work page 2023
[62]

\boxed{ }

with r=64 and α=64 for safety training of all the models. The learning-rate is lr=10 −5 for the Mistral models and lr=3×10 −5 for Qwen models. In order to keep the gradient norm consistent across values of λ, we omitted the λ factor in Equation (4.1) and tuned β to keep the final KL bounded, rather than changing the learning-rate for each λ. We used 2 epo...

work page 2000
[63]

Llama-3.1-8B-Instruct-RM-RB2

(Miller, 1974; Jiao and Han, 2020). The jackknife targets this bias by using the leave-one-out terms ˆg(X−j ) to estimate the leading 1/k term in the Taylor expansion of the bias, and subtracting it via the linear combination in ˜g(X). As a result, the O(1/k) term cancels and the remaining bias isO(1/k 2). F Additional Experiments F.1 High learning-rate S...

work page 1974
[64]

exhibit an initial jump in loss. These models are adversarially trained to 25 Figure 12:(left)We measure the SFT loss during fine-tuning to show that the FRPO-trained model fits the downstream task as well as other models.(right)FRPO also keeps the general capabilities higher that GRPO after fine-tuning. Figure 13: The refusal rate and the StrongREJECT sc...

work page 2024