Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
Pith reviewed 2026-05-16 06:36 UTC · model grok-4.3
The pith
Orthogonal projection of safety gradients onto a low-rank general-capability subspace reduces the alignment tax during sequential LLM post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety gradients can be made less harmful to general capabilities by removing their projection onto a low-rank subspace spanned by gradients computed on a small general-capability reference set. The resulting update rule, called OGPSA, yields the locally optimal safety improvement subject to first-order preservation of the reference objectives and is compatible with existing SFT and DPO pipelines.
What carries the argument
Orthogonal Gradient Projection for Safety Alignment (OGPSA): a lightweight update that subtracts from each safety gradient its component in the low-rank subspace spanned by reference gradients from general-capability data.
If this is right
- Under sequential SFT followed by DPO, average performance gain rises from 33.98% to 42.74% on Qwen2.5-7B-Instruct.
- The same pipeline yields an increase from 19.74% to 32.98% on Llama3.1-8B-Instruct.
- The method works with both SFT and DPO without requiring large-scale replay of prior data.
- Periodic reference-gradient computation replaces full replay while still constraining the safety update.
Where Pith is reading between the lines
- The same projection idea could be applied to other sequential fine-tuning regimes where one objective risks erasing earlier skills.
- If the low-rank estimate remains stable across model scales, the technique may reduce the data volume needed for capability preservation.
- Adaptive choice of subspace rank or online re-estimation could further tighten the preservation constraint.
Load-bearing premise
A low-rank subspace estimated from a small general-capability dataset accurately identifies the directions whose first-order preservation is enough to prevent capability regression.
What would settle it
Run standard SFT or DPO versus OGPSA on the same model and data; if the final safety and utility scores are statistically indistinguishable or worse under OGPSA, the projection step provides no net benefit.
Figures
read the original abstract
Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that safety alignment can be viewed as a continual learning problem where safety gradients interfere with directions supporting general capabilities, leading to an alignment tax. It proposes OGPSA, which estimates a low-rank subspace from gradients on a small general-capability dataset and projects each safety gradient to be orthogonal to this subspace. This yields the steepest local safety descent subject to first-order preservation of reference objectives. The method is applied to SFT, DPO, and sequential SFT→DPO pipelines, reporting improved safety-utility trade-offs, including average performance gains rising from 33.98% to 42.74% on Qwen2.5-7B-Instruct and from 19.74% to 32.98% on Llama3.1-8B-Instruct, with open-sourced code.
Significance. If the empirical results hold, OGPSA provides a lightweight, replay-free mechanism for mitigating one source of capability regression during sequential alignment stages. The first-order projection is transparent, the gains are demonstrated across two model families and three regimes, and the open-sourced implementation allows direct verification. This could be a practical addition to post-training pipelines when gradient interference between safety and utility objectives is a concern.
major comments (2)
- [§3] §3 (OGPSA definition): the central claim that removing the projected component avoids capability regression while preserving safety descent rests on the low-rank subspace from the small general dataset accurately capturing the relevant directions; the manuscript reports gains only for the chosen dataset and rank without ablations on dataset size, diversity, or rank sensitivity, so it is unclear whether the improvement generalizes when the subspace is less representative.
- [§4] §4 (experimental results): the reported average gains under SFT→DPO are load-bearing for the main claim, yet no analysis is given of how the projection affects the safety loss surface or convergence when safety and capability gradients overlap substantially; this leaves the weakest assumption untested.
minor comments (1)
- [Abstract and §3] The abstract and §3 mention periodic reference-gradient computation but do not quantify its frequency or overhead relative to standard SFT/DPO steps; adding a short complexity table would improve clarity.
Simulated Author's Rebuttal
We appreciate the referee's positive evaluation and recommendation for minor revision. We address the two major comments below, agreeing that additional analyses will improve the manuscript, and will incorporate them accordingly.
read point-by-point responses
-
Referee: [§3] §3 (OGPSA definition): the central claim that removing the projected component avoids capability regression while preserving safety descent rests on the low-rank subspace from the small general dataset accurately capturing the relevant directions; the manuscript reports gains only for the chosen dataset and rank without ablations on dataset size, diversity, or rank sensitivity, so it is unclear whether the improvement generalizes when the subspace is less representative.
Authors: We thank the referee for highlighting this important point. The low-rank subspace is indeed central to the method's effectiveness. While our experiments use a standard general-capability dataset and a fixed rank chosen based on preliminary validation, we agree that explicit ablations would better support generalizability. In the revised version, we will add experiments varying dataset size (e.g., 50, 100, 500 samples), diversity (comparing to other benchmarks like MMLU subsets), and rank (k=2 to 32). These will be included in a new subsection in §3 or §4, with results showing that performance remains stable for reasonable choices of these hyperparameters. revision: yes
-
Referee: [§4] §4 (experimental results): the reported average gains under SFT→DPO are load-bearing for the main claim, yet no analysis is given of how the projection affects the safety loss surface or convergence when safety and capability gradients overlap substantially; this leaves the weakest assumption untested.
Authors: We agree that analyzing the effect on the loss surface and convergence under high overlap would strengthen the paper. In the revision, we will add a discussion and empirical analysis: specifically, we will compute the cosine similarity between safety and capability gradients before and after projection, and plot safety loss curves for cases with varying overlap. This will be presented in §4 to illustrate how OGPSA mitigates interference without stalling convergence. revision: yes
Circularity Check
No significant circularity in OGPSA derivation chain
full rationale
The central mechanism defines the update as the safety gradient with its component in the low-rank general-capability subspace removed. This is an explicit first-order projection constructed directly from separate gradient computations on two data sources; the final safety-utility metrics are not used to define or fit the subspace or projection operator. No step renames a fitted quantity as a prediction, imports uniqueness via self-citation, or smuggles an ansatz through prior work by the same authors. The reported gains (e.g., 33.98% to 42.74%) are presented as empirical outcomes of applying the rule, not as quantities that reduce to the rule's inputs by algebraic identity. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- subspace rank
axioms (2)
- domain assumption The directions most relevant to general capabilities are spanned by gradients computed on a small held-out general-capability dataset.
- domain assumption First-order orthogonality is sufficient to prevent capability regression during subsequent safety updates.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
-
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
A General Language Assistant as a Laboratory for Alignment
Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. InNeurips Safe Generative AI Workshop 2024,
work page 2024
-
[6]
Training Verifiers to Solve Math Word Problems
9 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023
Dong, Y ., Chen, H., Chen, J., Fang, Z., Yang, X., Zhang, Y ., Tian, Y ., Su, H., and Zhu, J. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,
-
[8]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
H., Sahay, S., Chen, S.- T., and Lee, H.-y
Farn, H., Su, H., Kumar, S. H., Sahay, S., Chen, S.- T., and Lee, H.-y. Safeguard fine-tuned llms through pre-and post-tuning model merging.arXiv preprint arXiv:2412.19512,
-
[10]
Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models.arXiv preprint arXiv:2412.16339,
-
[11]
Continual learning for text classification with information disentanglement based regularization
Huang, Y ., Zhang, Y ., Chen, J., Wang, X., and Yang, D. Continual learning for text classification with information disentanglement based regularization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2736–2746,
work page 2021
-
[12]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference
Ji, J., Hong, D., Zhang, B., Chen, B., Dai, J., Zheng, B., Qiu, T., Li, B., and Yang, Y . Pku-saferlhf: A safety alignment preference dataset for llama family models. arXiv preprint arXiv:2406.15513,
-
[15]
Kantharaj, S., Do, X. L., Leong, R. T., Tan, J. Q., Hoque, E., and Joty, S. Opencqa: Open-ended question answering with charts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11817–11837,
work page 2022
-
[16]
Halue- val: A large-scale hallucination evaluation benchmark for large language models
Li, J., Cheng, X., Zhao, X., Nie, J.-Y ., and Wen, J.-R. Halue- val: A large-scale hallucination evaluation benchmark for large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing. Li, Z. and Hoiem, D. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947,
work page 2023
-
[17]
Mitigating the alignment tax of RLHF
Lin, Y ., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y ., and Zhang, T. Mitigating the alignment tax of RLHF. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 580–606,
work page 2024
-
[18]
Liu, G., Zhu, Y ., Chen, J., and Jiang, M. Scientific algorithm discovery by augmenting alphaevolve with deep research. arXiv preprint arXiv:2510.06056,
-
[19]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking chat- gpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860,
work page internal anchor Pith review arXiv
-
[20]
XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els
R¨ottger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,
work page 2024
-
[21]
11 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehen- sive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024a. Wang, Y ., Li, H., Han, X., Nakov, P., and Bal...
work page 2024
-
[22]
Reinforcement Learning for LLM Post-Training: A Survey
Wang, Z., Bi, B., Pentyala, S. K., Ramnath, K., Chaudhuri, S., Mehrotra, S., Mao, X.-B., Asur, S., et al. A compre- hensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024c. Wang, Z., Yang, F., Wang, L., Zhao, P., Wang, H., Chen, L., Lin, Q., and Wong, K.-F. SELF-GUARD: Empower the LLM to safeguard ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.naacl-long 2024
-
[23]
Measuring short-form factuality in large language models
URL https://aclanthology.org/2024. naacl-long.92/. Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models.arXiv preprint arXiv:2411.04368,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Wu, L., Wang, M., Xu, Z., Cao, T., Oo, N., Hooi, B., and Deng, S. Automating steering for safe multimodal large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 792–814, 2025a. Wu, Y ., Piao, H., Huang, L.-K., Wang, R., Li, W., Pfister, H., Meng, D., Ma, K., and Wei, Y . Sd-lora: Scalable decou...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Zhang, G., Wang, L., Kang, G., Chen, L., and Wei, Y . Slca: Slow learner with classifier alignment for con- tinual learning on a pre-trained model.arXiv preprint arXiv:2303.05118,
-
[26]
Instruction-Following Evaluation for Large Language Models
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y ., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023a. Zhou, D.-W., Sun, H.-L., Ning, J., Ye, H.-J., and Zhan, D.-C. Continual learning with pre-trained models: a survey. In Proceedings of the Thirty-Third I...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
and Qwen2.5-7B-Instruct (Yang et al., 2024b). For the safety alignment phase, we utilize a seed dataset Dsafe consisting of 10k samples sampled from PKU-SafeRLHF (Ji et al., 2024), where the SFT labels and DPO chosen labels are refuse response generated by gpt-4omini, and DPO rejected labels are select from the most unsafe answer from the original dataset...
work page 2024
-
[28]
as the general data source. Where the SFT labels and DPO chosen labels are selected from the highest score answer from original dataset, and DPO rejected labels are select from the lowest score answer. For our proposed method (OGPSA), we require a small reference set to estimate the general capability subspace. To efficiently capture the essential dimensi...
work page 2021
-
[29]
into the safety training data. Training DetailsWe have done all the training of LLMs with LLaMA-Factory (Zheng et al., 2024), which is a popular toolbox for LLM training. Consistent with established protocols (Zhang et al., 2025), all models are trained for 3 epochs during the SFT stage and 1 epoch during the DPO stage. We tune the learning rate 1e−6 and ...
work page 2024
-
[30]
Following the official implementation, we set learning rate 1e−4 for LoRA
We adopt a cosine scheduler with a warm-up ratio of 0.1. Following the official implementation, we set learning rate 1e−4 for LoRA. For the subspace update frequency K. We set K= 30 for all SFT andK= 5for DPO experiments. Evaluation.We employ a comprehensive suite of 10 benchmarks to evaluate the trade-off between safety and general utility.Safety (Harmle...
work page 2024
-
[31]
Additionally, we evaluate adversarial robustness using AdvGLUE (Wang et al., 2021)
and PAP (Zeng et al., 2024)), while reporting refusal rates for other datasets.General Utility:We assess diverse capabilities including truthfulness via SimpleQA (Wei et al., 2024), GPQA (Rein et al., 2024), and MMLU (Hendrycks et al., 2021a), and general helpfulness via BIG-bench HHH (Zhou et al., 2024b) and instruction following via IFEval (Zhou et al.,...
work page 2024
-
[32]
and take the goodness score, which is 1−rubric score , as the metric. We evaluate models on prompts with no jailbreak in addition to the reported top-2 jailbreak methods PAIR (Chao et al., 2025), and PAP- Misrepresentation (Zeng et al., 2024). For main results, we only report the average goodness score on the two jailbreak methods, since most methods achi...
work page 2025
-
[33]
By the Cauchy-Schwarz (Bj ¨orck, 1994; Leon et al.,
Substituting the decomposition: ⟨g, v⟩=⟨g S +g ⊥, v⟩=⟨g S , v⟩+⟨g ⊥, v⟩| {z } 0 =⟨g S , v⟩.(20) Step 4: Minimization via Cauchy-Schwarz.The problem reduces to minimizing the inner product ⟨gS , v⟩ subject to unit norm. By the Cauchy-Schwarz (Bj ¨orck, 1994; Leon et al.,
work page 1994
-
[34]
inequality: |⟨gS , v⟩| ≤ ∥g S ∥∥v∥=∥g S ∥.(21) This implies: −∥gS ∥ ≤ ⟨g S , v⟩ ≤ ∥g S ∥.(22) 1https://platform.openai.com/docs/guides/moderation 15 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax The lower bound (maximum descent) is achieved if and only ifv is collinear to gS and points in the opposite ...
-
[35]
Our method remains effective even with limited data budgets
on Qwen.We evaluate the performance stability as the number of samples used to estimate the reference capability gradients increases. Our method remains effective even with limited data budgets. Thebestresults are marked inbold. Model Safety (↑) Truthful (↑)Helpful (↑) Stereotype StrongReject SimpleQA MMLU IFEval HHH Qwen2.5-7B-Instruct Model 96.74 44.83 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.