Recognition: 2 theorem links
· Lean TheoremRobust Policy Optimization to Prevent Catastrophic Forgetting
Pith reviewed 2026-05-16 05:30 UTC · model grok-4.3
The pith
A max-min objective over KL neighborhoods makes RLHF policies stable against downstream fine-tuning loss of safety.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing the standard RLHF objective with a max-min formulation that optimizes the worst-case reward inside a KL-bounded neighborhood of policies, the resulting base policy maintains high reward even after subsequent standard fine-tuning steps, thereby reducing catastrophic forgetting of earlier behaviors such as safety alignment.
What carries the argument
The max-min objective in FRPO, which maximizes the minimum reward over all policies inside a KL divergence ball around the current policy.
If this is right
- Safety degradation is substantially reduced across multiple base models under both supervised fine-tuning and reinforcement learning downstream regimes.
- Downstream task performance remains comparable to policies trained with standard objectives.
- The robustness benefit extends to math-focused RL settings where accuracy is preserved under subsequent fine-tuning.
- The algorithm requires no extra computation beyond the base GRPO procedure.
Where Pith is reading between the lines
- Pre-training robustness at the RLHF stage may reduce the need for separate forgetting-prevention techniques applied after every downstream update.
- The KL-ball approach could be tested on continual learning problems outside language modeling where sequential adaptation also causes forgetting.
- If real fine-tuning trajectories often exit the assumed KL neighborhood, the method's protection would be limited to only mild adaptation steps.
Load-bearing premise
The set of policies actually reached by standard downstream fine-tuning lies inside the KL-bounded neighborhood used in the max-min optimization.
What would settle it
Apply FRPO, then run a downstream fine-tuning step whose resulting policy lies inside the KL ball yet produces a large drop in safety reward.
Figures
read the original abstract
Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fine-tuning Robust Policy Optimization (FRPO), a modification of GRPO that replaces the standard objective with a max-min formulation over reward within a fixed KL ball around the base policy. The goal is to produce base policies whose reward remains stable under subsequent downstream SFT or RL fine-tuning, thereby reducing catastrophic forgetting of safety behaviors. The authors claim the modification incurs no extra computation and present empirical results showing reduced safety degradation across multiple base models and fine-tuning regimes while preserving downstream task performance; they also report results in a math-focused RL setting.
Significance. If the central empirical claim holds under the stated assumptions, the work offers a practical pre-fine-tuning intervention that could reduce reliance on post-hoc forgetting mitigation in multi-stage LLM pipelines. The no-extra-computation property and the extension to both SFT and RL downstream regimes are notable strengths for adoption.
major comments (2)
- [§3 and §4] §3 (FRPO objective) and §4 (experiments): The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.
- [§3.2] §3.2 (algorithmic modification): The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.
minor comments (2)
- [Tables/Figures] Table 1 and Figure 2: axis labels and legend entries should explicitly state the KL radius used for each FRPO run so readers can assess sensitivity.
- [Related Work] Related-work section: the discussion of prior robust RLHF methods should cite the specific KL-ball radii or neighborhood sizes employed in those works for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will incorporate the requested clarifications and measurements in the revised manuscript.
read point-by-point responses
-
Referee: [§3 and §4] The central claim that optimizing worst-case reward inside the KL ball protects against real downstream adaptation rests on the unverified assumption that policies reached by standard SFT/RL fine-tuning remain inside that ball. No measurements of post-fine-tuning KL divergence (relative to the radius chosen in the max-min objective) are reported for any of the evaluated regimes, leaving coverage untested.
Authors: We agree that reporting the actual post-fine-tuning KL divergences would directly test whether the downstream policies fall inside the robustness ball. In the revision we will add these measurements for all SFT and RL regimes and base models, computed with the same tokenizer and reference policy used in FRPO. The KL radius was selected from values commonly observed in the fine-tuning literature; the new tables will allow readers to verify coverage empirically. revision: yes
-
Referee: [§3.2] The claim of 'no extra computation' relative to GRPO requires explicit verification that the inner maximization over the KL ball is achieved without additional sampling, gradient steps, or hyperparameter tuning beyond standard GRPO; the current description does not detail the approximation used or its computational equivalence.
Authors: We will expand §3.2 with a full derivation and pseudocode. The inner maximization is solved in closed form by reweighting the existing advantage estimates with a single scalar Lagrange multiplier for the KL constraint; this multiplier is obtained via a lightweight line search on the same batch of samples already drawn for the GRPO outer update. Consequently, no new policy samples, forward passes, or gradient steps are required, and the only added hyperparameter is the fixed KL radius (set once, not tuned per run). The revised text will include wall-clock timings confirming identical per-iteration cost. revision: yes
Circularity Check
No circularity: max-min objective defined directly from reward and KL ball
full rationale
The derivation introduces FRPO as a max-min formulation that optimizes worst-case reward inside a fixed KL neighborhood around the base policy. This is a standard robust-optimization construction stated directly in terms of the reward function and the KL divergence constraint; it does not reduce to any fitted parameter, self-referential prediction, or load-bearing self-citation. No uniqueness theorem, ansatz smuggled via prior work, or renaming of an empirical pattern is invoked. The central claim therefore remains independent of its own outputs and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Downstream adaptations remain inside a KL-bounded neighborhood of the base policy
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
max_π inf_Q E[r(x,y)] s.t. E[KL(Q∥π)]≤ρ (Eq. 3.1); dual yields −λ log E[exp(−r/λ)] entropic risk (Eq. 3.5)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FRPO modifies GRPO with no extra compute; λ controls risk aversion; recovers GRPO as λ→∞
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,
Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156,
-
[3]
Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majum- dar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030,
-
[4]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673,
-
[9]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei
URL https: //proceedings.neurips.cc/paper_files/paper/2015/file/64223ccf70bbb65a3a4aceac37e21016-Paper.pdf. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30,
work page 2015
-
[10]
Beyond variance reduction: Understanding the true impact of baselines on policy optimization
Wesley Chung, Valentin Thomas, Marlos C Machado, and Nicolas Le Roux. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. InInternational conference on machine learning, pages 1999–2009. PMLR,
work page 1999
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Ultrafeedback: Boosting language models with high-quality feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377,
-
[13]
Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024
Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,
-
[14]
Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894,
-
[15]
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, and Holger Boche. Safemerge: Preserving safety alignment in fine-tuned large language models via selective layer-wise model merging. arXiv preprint arXiv:2503.17239,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine- tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305,
-
[17]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Maximum entropy rl (provably) solves some robust rl problems
Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257,
-
[19]
Yingjie Fei, Zhuoran Yang, Yudong Chen, and Zhaoran Wang. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.Advances in neural information processing systems, 34:20436–20446, 2021a. Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. Risk-sensitive reinforcement learning with function approx- imation: A debiasing a...
- [20]
-
[21]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines
13 Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal
Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. arXiv preprint arXiv:2403.01244,
-
[25]
Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/abs/2310.06825. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghal- lah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094– 47165,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178,
-
[28]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, and Aviral Kumar. Reasoning as an adaptive defense for safety.arXiv preprint arXiv:2507.00971,
-
[30]
Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,
-
[31]
Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to finetune large-scale pretrained language models.arXiv preprint arXiv:1909.11299,
-
[32]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
RewardBench 2: Advancing Reward Model Evaluation
Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
URL https://arxiv.org/abs/2407.21783. Rupert G. Miller. The jackknife–a review.Biometrika, 61(1):1–15,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
URL http://www.jstor.org/stable/2334280
ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334280. Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprint arXiv:2006.04884,
-
[37]
Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang
URL https://arxiv.org/abs/2502.02421. Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling.arXiv preprint arXiv:1909.02060,
-
[38]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
15 Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,
-
[40]
doi: 10.52202/079017-3092. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ b0bc711f48724237b38823c4d9cee10b-Paper-Conference.pdf. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingr...
-
[41]
URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[44]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,
Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,
-
[46]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
16 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
URL https://doi.org/10.1137/16M1058297
doi: 10.1137/16M1058297. URL https://doi.org/10.1137/16M1058297. Alexander Shapiro and Anton Kleywegt. Minimax analysis of stochastic problems.Optimization Methods and Software, 17(3):523–542,
-
[48]
Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi
doi: 10.1080/1055678021000034008. Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi. Certifying some distributional robustness with principled adversarial training.arXiv preprint arXiv:1710.10571,
-
[49]
Distributionally Robust Reinforcement Learning
Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[50]
Lamol: Language modeling for lifelong language learning
Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee. Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329,
-
[51]
Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761,
-
[52]
Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, and Bo Zheng. Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models.arXiv preprint arXiv:2502.11555,
-
[53]
Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,
Eugene Vinitsky, Yuqing Du, Kanaad Parvate, Kathy Jang, Pieter Abbeel, and Alexandre Bayen. Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,
-
[54]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671,
work page 2023
-
[55]
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949,
-
[56]
Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning.arXiv preprint arXiv:2311.05553,
-
[57]
Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,
Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training.arXiv preprint arXiv:2203.08065,
-
[58]
18 A Additional Related Work Catastrophic forgetting.Catastrophic forgetting—the abrupt loss of previously learned knowledge when training on new tasks—has been studied since early neural network research (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). The phenomenon arises because gradient updates for new objectives overwrite parameters critic...
work page 1989
-
[59]
stochastically resets toward pretrained weights, while other methods prevent the policy from drifting too far (Li and Hoiem, 2017; Schulman et al., 2017; Lee et al., 2019), or ensure that the model’s latent representations are preserved (Kirkpatrick et al., 2017; Aghajanyan et al., 2020; Pan et al., 2024). Rehearsal methods augment downstream training wit...
work page 2017
-
[60]
and progressive networks (Rusu et al., 2016)), thereby reducing parameter interference between the objectives (Hsu et al., 2024; Qiao and Mahdavi, 2024), or aiming to keep the features orthogonal (Wang et al., 2023; Qiao and Mahdavi, 2024). Finally, recent model merging methods aim to mathematically combine task-specific models post-hoc to retain all capa...
work page 2016
-
[61]
and are used to preserve general instruction-following behavior and mitigate over-refusal. Reward models.We use separate rewards for harmful vs. harmless subsets. For harmful prompts, the reward is 1−(s 1 +s 2)/2, where s1,s 2 are scores from OpenAI Moderation API and StrongREJECT judge (Markov et al., 2023; Souly et al., 2024). For harmless prompts, we u...
work page 2023
-
[62]
with r=64 and α=64 for safety training of all the models. The learning-rate is lr=10 −5 for the Mistral models and lr=3×10 −5 for Qwen models. In order to keep the gradient norm consistent across values of λ, we omitted the λ factor in Equation (4.1) and tuned β to keep the final KL bounded, rather than changing the learning-rate for each λ. We used 2 epo...
work page 2000
-
[63]
(Miller, 1974; Jiao and Han, 2020). The jackknife targets this bias by using the leave-one-out terms ˆg(X−j ) to estimate the leading 1/k term in the Taylor expansion of the bias, and subtracting it via the linear combination in ˜g(X). As a result, the O(1/k) term cancels and the remaining bias isO(1/k 2). F Additional Experiments F.1 High learning-rate S...
work page 1974
-
[64]
exhibit an initial jump in loss. These models are adversarially trained to 25 Figure 12:(left)We measure the SFT loss during fine-tuning to show that the FRPO-trained model fits the downstream task as well as other models.(right)FRPO also keeps the general capabilities higher that GRPO after fine-tuning. Figure 13: The refusal rate and the StrongREJECT sc...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.