pith. sign in

arxiv: 2602.07892 · v2 · submitted 2026-02-08 · 💻 cs.LG · cs.CL

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Pith reviewed 2026-05-16 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords safety alignmentalignment taxcontinual learningorthogonal gradient projectionLLM post-trainingSFTDPOgradient interference
0
0 comments X

The pith

Orthogonal projection of safety gradients onto a low-rank general-capability subspace reduces the alignment tax during sequential LLM post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats safety post-training as a continual-learning problem in which safety gradients interfere with directions that support earlier general capabilities. It introduces OGPSA, which estimates a low-rank reference subspace from a small set of general-capability gradients and subtracts the component of each safety gradient that lies inside it. The resulting update is the steepest safety-descent direction that leaves the reference objectives unchanged to first order. This approach is shown to improve the safety-utility trade-off on both SFT and DPO stages without replay buffers, with measured gains under sequential pipelines on two 7-8B models.

Core claim

Safety gradients can be made less harmful to general capabilities by removing their projection onto a low-rank subspace spanned by gradients computed on a small general-capability reference set. The resulting update rule, called OGPSA, yields the locally optimal safety improvement subject to first-order preservation of the reference objectives and is compatible with existing SFT and DPO pipelines.

What carries the argument

Orthogonal Gradient Projection for Safety Alignment (OGPSA): a lightweight update that subtracts from each safety gradient its component in the low-rank subspace spanned by reference gradients from general-capability data.

If this is right

  • Under sequential SFT followed by DPO, average performance gain rises from 33.98% to 42.74% on Qwen2.5-7B-Instruct.
  • The same pipeline yields an increase from 19.74% to 32.98% on Llama3.1-8B-Instruct.
  • The method works with both SFT and DPO without requiring large-scale replay of prior data.
  • Periodic reference-gradient computation replaces full replay while still constraining the safety update.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection idea could be applied to other sequential fine-tuning regimes where one objective risks erasing earlier skills.
  • If the low-rank estimate remains stable across model scales, the technique may reduce the data volume needed for capability preservation.
  • Adaptive choice of subspace rank or online re-estimation could further tighten the preservation constraint.

Load-bearing premise

A low-rank subspace estimated from a small general-capability dataset accurately identifies the directions whose first-order preservation is enough to prevent capability regression.

What would settle it

Run standard SFT or DPO versus OGPSA on the same model and data; if the final safety and utility scores are statistically indistinguishable or worse under OGPSA, the projection step provides no net benefit.

Figures

Figures reproduced from arXiv: 2602.07892 by Guanglong Sun, Hang Su, Jun Zhu, Liyuan Wang, Siyuan Zhang, Yi Zhong.

Figure 1
Figure 1. Figure 1: Conceptual framework for reframing LLM Safety Alignment as a Constrained Continual Learning Problem. (A) Comparison of traditional CL and LLM Heterogeneous CL. (B) Safety alignment under anti-forgetting constraints. 1. Introduction Large Language Models (LLMs) have emerged as highly capable general-purpose systems (Achiam et al., 2023; Bai et al., 2023; Dubey et al., 2024), achieving strong perfor￾mance in… view at source ↗
Figure 2
Figure 2. Figure 2: Overall performance of alignment strategies on Qwen2.5-7B-Instruct. We report the aggregate Safety Score (avg. of 4 datasets) and General Capacity Score (avg. of 6 datasets); see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic illustration of the proposed Orthogonal Gradient Projection for Safety Alignment (OGPSA) frame￾work. gref1, gref2: Reference gradients computed from represen￾tative general capability datasets (e.g., helpfulness, truthfulness). gsafe: The standard gradient derived from the safety alignment ob￾jective. g˜safe: The projected safety gradient obtained by projecting gsafe onto the orthogonal space of … view at source ↗
Figure 4
Figure 4. Figure 4: Overall performance of alignment strategies on Llama3.1-8B-Instruct. We report the aggregate Safety Score (avg. of 4 datasets) and General Capacity Score (avg. of 6 datasets); see [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that safety alignment can be viewed as a continual learning problem where safety gradients interfere with directions supporting general capabilities, leading to an alignment tax. It proposes OGPSA, which estimates a low-rank subspace from gradients on a small general-capability dataset and projects each safety gradient to be orthogonal to this subspace. This yields the steepest local safety descent subject to first-order preservation of reference objectives. The method is applied to SFT, DPO, and sequential SFT→DPO pipelines, reporting improved safety-utility trade-offs, including average performance gains rising from 33.98% to 42.74% on Qwen2.5-7B-Instruct and from 19.74% to 32.98% on Llama3.1-8B-Instruct, with open-sourced code.

Significance. If the empirical results hold, OGPSA provides a lightweight, replay-free mechanism for mitigating one source of capability regression during sequential alignment stages. The first-order projection is transparent, the gains are demonstrated across two model families and three regimes, and the open-sourced implementation allows direct verification. This could be a practical addition to post-training pipelines when gradient interference between safety and utility objectives is a concern.

major comments (2)
  1. [§3] §3 (OGPSA definition): the central claim that removing the projected component avoids capability regression while preserving safety descent rests on the low-rank subspace from the small general dataset accurately capturing the relevant directions; the manuscript reports gains only for the chosen dataset and rank without ablations on dataset size, diversity, or rank sensitivity, so it is unclear whether the improvement generalizes when the subspace is less representative.
  2. [§4] §4 (experimental results): the reported average gains under SFT→DPO are load-bearing for the main claim, yet no analysis is given of how the projection affects the safety loss surface or convergence when safety and capability gradients overlap substantially; this leaves the weakest assumption untested.
minor comments (1)
  1. [Abstract and §3] The abstract and §3 mention periodic reference-gradient computation but do not quantify its frequency or overhead relative to standard SFT/DPO steps; adding a short complexity table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's positive evaluation and recommendation for minor revision. We address the two major comments below, agreeing that additional analyses will improve the manuscript, and will incorporate them accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (OGPSA definition): the central claim that removing the projected component avoids capability regression while preserving safety descent rests on the low-rank subspace from the small general dataset accurately capturing the relevant directions; the manuscript reports gains only for the chosen dataset and rank without ablations on dataset size, diversity, or rank sensitivity, so it is unclear whether the improvement generalizes when the subspace is less representative.

    Authors: We thank the referee for highlighting this important point. The low-rank subspace is indeed central to the method's effectiveness. While our experiments use a standard general-capability dataset and a fixed rank chosen based on preliminary validation, we agree that explicit ablations would better support generalizability. In the revised version, we will add experiments varying dataset size (e.g., 50, 100, 500 samples), diversity (comparing to other benchmarks like MMLU subsets), and rank (k=2 to 32). These will be included in a new subsection in §3 or §4, with results showing that performance remains stable for reasonable choices of these hyperparameters. revision: yes

  2. Referee: [§4] §4 (experimental results): the reported average gains under SFT→DPO are load-bearing for the main claim, yet no analysis is given of how the projection affects the safety loss surface or convergence when safety and capability gradients overlap substantially; this leaves the weakest assumption untested.

    Authors: We agree that analyzing the effect on the loss surface and convergence under high overlap would strengthen the paper. In the revision, we will add a discussion and empirical analysis: specifically, we will compute the cosine similarity between safety and capability gradients before and after projection, and plot safety loss curves for cases with varying overlap. This will be presented in §4 to illustrate how OGPSA mitigates interference without stalling convergence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in OGPSA derivation chain

full rationale

The central mechanism defines the update as the safety gradient with its component in the low-rank general-capability subspace removed. This is an explicit first-order projection constructed directly from separate gradient computations on two data sources; the final safety-utility metrics are not used to define or fit the subspace or projection operator. No step renames a fitted quantity as a prediction, imports uniqueness via self-citation, or smuggles an ansatz through prior work by the same authors. The reported gains (e.g., 33.98% to 42.74%) are presented as empirical outcomes of applying the rule, not as quantities that reduce to the rule's inputs by algebraic identity. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that a low-rank linear subspace estimated from a modest general-capability dataset is sufficient to protect first-order capability directions; no new entities are postulated and the only free parameter is the subspace rank.

free parameters (1)
  • subspace rank
    Dimensionality of the reference subspace estimated from general-capability gradients; chosen as a hyperparameter and not derived from data.
axioms (2)
  • domain assumption The directions most relevant to general capabilities are spanned by gradients computed on a small held-out general-capability dataset.
    Invoked when constructing the reference subspace for projection.
  • domain assumption First-order orthogonality is sufficient to prevent capability regression during subsequent safety updates.
    Core premise of the orthogonal projection step.

pith-pipeline@v0.9.0 · 5643 in / 1401 out tokens · 29864 ms · 2026-05-16T06:36:44.897199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  2. Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

    cs.CR 2026-04 unverdicted novelty 6.0

    Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

  3. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 5.0

    ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861,

  3. [3]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  4. [4]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  5. [5]

    K., Du, X., and Li, Y

    Choi, H. K., Du, X., and Li, Y . Safety-aware fine-tuning of large language models. InNeurips Safe Generative AI Workshop 2024,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    9 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  7. [7]

    How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

    Dong, Y ., Chen, H., Chen, J., Fang, Z., Yang, X., Zhang, Y ., Tian, Y ., Su, H., and Zhu, J. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,

  8. [8]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  9. [9]

    H., Sahay, S., Chen, S.- T., and Lee, H.-y

    Farn, H., Su, H., Kumar, S. H., Sahay, S., Chen, S.- T., and Lee, H.-y. Safeguard fine-tuned llms through pre-and post-tuning model merging.arXiv preprint arXiv:2412.19512,

  10. [10]

    Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

    Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models.arXiv preprint arXiv:2412.16339,

  11. [11]

    Continual learning for text classification with information disentanglement based regularization

    Huang, Y ., Zhang, Y ., Chen, J., Wang, X., and Yang, D. Continual learning for text classification with information disentanglement based regularization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2736–2746,

  12. [12]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

  13. [13]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  14. [14]

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

    Ji, J., Hong, D., Zhang, B., Chen, B., Dai, J., Zheng, B., Qiu, T., Li, B., and Yang, Y . Pku-saferlhf: A safety alignment preference dataset for llama family models. arXiv preprint arXiv:2406.15513,

  15. [15]

    L., Leong, R

    Kantharaj, S., Do, X. L., Leong, R. T., Tan, J. Q., Hoque, E., and Joty, S. Opencqa: Open-ended question answering with charts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11817–11837,

  16. [16]

    Halue- val: A large-scale hallucination evaluation benchmark for large language models

    Li, J., Cheng, X., Zhao, X., Nie, J.-Y ., and Wen, J.-R. Halue- val: A large-scale hallucination evaluation benchmark for large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing. Li, Z. and Hoiem, D. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947,

  17. [17]

    Mitigating the alignment tax of RLHF

    Lin, Y ., Lin, H., Xiong, W., Diao, S., Liu, J., Zhang, J., Pan, R., Wang, H., Hu, W., Zhang, H., Dong, H., Pi, R., Zhao, H., Jiang, N., Ji, H., Yao, Y ., and Zhang, T. Mitigating the alignment tax of RLHF. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 580–606,

  18. [18]

    CoRR , volume =

    Liu, G., Zhu, Y ., Chen, J., and Jiang, M. Scientific algorithm discovery by augmenting alphaevolve with deep research. arXiv preprint arXiv:2510.06056,

  19. [19]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking chat- gpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860,

  20. [20]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els

    R¨ottger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,

  21. [21]

    A comprehen- sive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024a

    11 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehen- sive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024a. Wang, Y ., Li, H., Han, X., Nakov, P., and Bal...

  22. [22]

    Reinforcement Learning for LLM Post-Training: A Survey

    Wang, Z., Bi, B., Pentyala, S. K., Ramnath, K., Chaudhuri, S., Mehrotra, S., Mao, X.-B., Asur, S., et al. A compre- hensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024c. Wang, Z., Yang, F., Wang, L., Zhao, P., Wang, H., Chen, L., Lin, Q., and Wong, K.-F. SELF-GUARD: Empower the LLM to safeguard ...

  23. [23]

    Measuring short-form factuality in large language models

    URL https://aclanthology.org/2024. naacl-long.92/. Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models.arXiv preprint arXiv:2411.04368,

  24. [24]

    Qwen2.5 Technical Report

    Wu, L., Wang, M., Xu, Z., Cao, T., Oo, N., Hooi, B., and Deng, S. Automating steering for safe multimodal large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 792–814, 2025a. Wu, Y ., Piao, H., Huang, L.-K., Wang, R., Li, W., Pfister, H., Meng, D., Ma, K., and Wei, Y . Sd-lora: Scalable decou...

  25. [25]

    Slca: Slow learner with classifier alignment for con- tinual learning on a pre-trained model.arXiv preprint arXiv:2303.05118,

    Zhang, G., Wang, L., Kang, G., Chen, L., and Wei, Y . Slca: Slow learner with classifier alignment for con- tinual learning on a pre-trained model.arXiv preprint arXiv:2303.05118,

  26. [26]

    Instruction-Following Evaluation for Large Language Models

    Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y ., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023a. Zhou, D.-W., Sun, H.-L., Ning, J., Ye, H.-J., and Zhan, D.-C. Continual learning with pre-trained models: a survey. In Proceedings of the Thirty-Third I...

  27. [27]

    and Qwen2.5-7B-Instruct (Yang et al., 2024b). For the safety alignment phase, we utilize a seed dataset Dsafe consisting of 10k samples sampled from PKU-SafeRLHF (Ji et al., 2024), where the SFT labels and DPO chosen labels are refuse response generated by gpt-4omini, and DPO rejected labels are select from the most unsafe answer from the original dataset...

  28. [28]

    Where the SFT labels and DPO chosen labels are selected from the highest score answer from original dataset, and DPO rejected labels are select from the lowest score answer

    as the general data source. Where the SFT labels and DPO chosen labels are selected from the highest score answer from original dataset, and DPO rejected labels are select from the lowest score answer. For our proposed method (OGPSA), we require a small reference set to estimate the general capability subspace. To efficiently capture the essential dimensi...

  29. [29]

    Training DetailsWe have done all the training of LLMs with LLaMA-Factory (Zheng et al., 2024), which is a popular toolbox for LLM training

    into the safety training data. Training DetailsWe have done all the training of LLMs with LLaMA-Factory (Zheng et al., 2024), which is a popular toolbox for LLM training. Consistent with established protocols (Zhang et al., 2025), all models are trained for 3 epochs during the SFT stage and 1 epoch during the DPO stage. We tune the learning rate 1e−6 and ...

  30. [30]

    Following the official implementation, we set learning rate 1e−4 for LoRA

    We adopt a cosine scheduler with a warm-up ratio of 0.1. Following the official implementation, we set learning rate 1e−4 for LoRA. For the subspace update frequency K. We set K= 30 for all SFT andK= 5for DPO experiments. Evaluation.We employ a comprehensive suite of 10 benchmarks to evaluate the trade-off between safety and general utility.Safety (Harmle...

  31. [31]

    Additionally, we evaluate adversarial robustness using AdvGLUE (Wang et al., 2021)

    and PAP (Zeng et al., 2024)), while reporting refusal rates for other datasets.General Utility:We assess diverse capabilities including truthfulness via SimpleQA (Wei et al., 2024), GPQA (Rein et al., 2024), and MMLU (Hendrycks et al., 2021a), and general helpfulness via BIG-bench HHH (Zhou et al., 2024b) and instruction following via IFEval (Zhou et al.,...

  32. [32]

    We evaluate models on prompts with no jailbreak in addition to the reported top-2 jailbreak methods PAIR (Chao et al., 2025), and PAP- Misrepresentation (Zeng et al., 2024)

    and take the goodness score, which is 1−rubric score , as the metric. We evaluate models on prompts with no jailbreak in addition to the reported top-2 jailbreak methods PAIR (Chao et al., 2025), and PAP- Misrepresentation (Zeng et al., 2024). For main results, we only report the average goodness score on the two jailbreak methods, since most methods achi...

  33. [33]

    By the Cauchy-Schwarz (Bj ¨orck, 1994; Leon et al.,

    Substituting the decomposition: ⟨g, v⟩=⟨g S +g ⊥, v⟩=⟨g S , v⟩+⟨g ⊥, v⟩| {z } 0 =⟨g S , v⟩.(20) Step 4: Minimization via Cauchy-Schwarz.The problem reduces to minimizing the inner product ⟨gS , v⟩ subject to unit norm. By the Cauchy-Schwarz (Bj ¨orck, 1994; Leon et al.,

  34. [34]

    inequality: |⟨gS , v⟩| ≤ ∥g S ∥∥v∥=∥g S ∥.(21) This implies: −∥gS ∥ ≤ ⟨g S , v⟩ ≤ ∥g S ∥.(22) 1https://platform.openai.com/docs/guides/moderation 15 Safety Alignment as Continual Learning: Orthogonal Gradient Projection for Mitigating Alignment Tax The lower bound (maximum descent) is achieved if and only ifv is collinear to gS and points in the opposite ...

  35. [35]

    Our method remains effective even with limited data budgets

    on Qwen.We evaluate the performance stability as the number of samples used to estimate the reference capability gradients increases. Our method remains effective even with limited data budgets. Thebestresults are marked inbold. Model Safety (↑) Truthful (↑)Helpful (↑) Stereotype StrongReject SimpleQA MMLU IFEval HHH Qwen2.5-7B-Instruct Model 96.74 44.83 ...