arxiv: 2604.17691 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI

Recognition: unknown

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

Dongxin Guo , Jikun Wu , Siu Ming Yiu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safety alignmentcontinual domain adaptationlarge language modelsLoRAFisher informationgradient constraintscumulative erosion

0 comments

The pith

SafeAnchor prevents cumulative safety erosion in LLMs by identifying and protecting low-rank safety subspaces during sequential domain adaptations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety alignment in large language models erodes when the same model is fine-tuned sequentially on new domains such as medicine, law, and code, because current methods only protect against single-task changes. The paper demonstrates that this erosion can be prevented by locating the parameter directions most responsible for safety using Fisher Information analysis and then forcing all new learning updates to avoid those directions. The resulting framework, SafeAnchor, also adds a monitoring step that replays safety examples when drift is detected. If the method succeeds, models can accumulate specialized capabilities across many domains without progressively losing their original refusal of harmful requests.

Core claim

Safety alignment resides in low-rank subspaces of the LoRA-adapted parameter space that remain stable enough to be identified once via Fisher Information eigendecomposition. By projecting domain-adaptation gradients onto the orthogonal complement of these subspaces and triggering corrective replay when safety thresholds are crossed, SafeAnchor keeps 93.2 percent of the original safety score while matching unconstrained fine-tuning performance on the new domain tasks within 1.5 points.

What carries the argument

Low-rank safety subspaces located by Fisher Information eigendecomposition in LoRA parameter space, with gradient updates during adaptation constrained to the orthogonal complement of those subspaces.

If this is right

Sequential adaptation pipelines can now be run on multiple specialized domains without progressive loss of refusal behavior.
Safety guardrails can be maintained while the model acquires new factual or procedural knowledge in each domain.
Monitoring with threshold-triggered replay becomes sufficient to handle any small residual drift that orthogonal projection misses.
Existing single-domain safety techniques become unnecessary once the subspace constraint is applied at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace-avoidance idea could be tested for preserving other fragile properties such as factual consistency or stylistic constraints.
Running the method on four or more domains or on 13B-scale models would show whether the low-rank assumption scales.
If the identified safety subspaces overlap across different base models, a single pre-computed anchor set might serve multiple architectures.
Combining the orthogonal constraint with periodic safety fine-tuning on mixed data could further reduce the need for replay.

Load-bearing premise

That the directions carrying safety alignment stay fixed and low-rank even after the model has been adapted to new domains, so that simply avoiding them does not block useful learning.

What would settle it

Execute the three-domain sequential fine-tuning pipeline on Llama-2-7B-Chat; if safety retention after the final domain falls below 80 percent or domain-task accuracy drops more than 3 points behind plain LoRA, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.17691 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

**Figure 1.** Figure 1: SafeAnchor pipeline. SSI identifies safety-critical LoRA directions via Fisher eigendecomposition. OSCA projects domain gradients orthogonally during training. CSM triggers corrective replay if safety degrades. The subspace is incrementally updated after each domain. orthogonal to the column span of prior-task LoRA matrices, preventing inter-task interference but providing no mechanism for protecting a beh… view at source ↗

**Figure 2.** Figure 2: Safety score trajectory across sequential domain adaptations on Llama-2-7BChat. SafeAnchor prevents the compounding erosion exhibited by all baselines [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeAnchor projects domain gradients orthogonal to Fisher-identified safety subspaces in LoRA space to limit cumulative erosion, with reported 93% retention on two models, but the subspaces' stability across updates is not shown.

read the letter

The paper's main contribution is a method that finds low-rank safety directions once via Fisher Information on the initial model, then keeps all later domain gradients in the orthogonal complement while adding a simple drift detector that triggers replay when safety drops below a threshold. It targets the sequential multi-domain case that single-task safety work left open, and the abstract numbers on Llama-2-7B-Chat and Mistral-7B-Instruct look usable: 93.2% safety retention, 18-42 points above baselines, and domain performance within 1.5 points of unconstrained fine-tuning across three domains and eight benchmarks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SafeAnchor, a framework for preserving safety alignment in LLMs during continual domain adaptation. It identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition on the initial model, constrains all subsequent domain-specific gradient updates to the orthogonal complement of these subspaces, and adds threshold-triggered corrective replay to handle residual drift. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, the method is claimed to retain 93.2% of original safety alignment, outperform baselines by 18-42 points, and match unconstrained fine-tuning performance to within 1.5 points on domain tasks.

Significance. If the central results hold under scrutiny, the work is significant because it is the first to target the multi-domain sequential adaptation setting for safety preservation, a gap left by prior single-task methods. The approach builds on standard Fisher Information and orthogonality tools in a targeted way for continual safety, and the concrete numbers on two models plus multiple benchmarks provide a starting point for practical impact in real-world LLM deployment pipelines.

major comments (2)

[Method (Fisher subspace identification and projection step)] The stability of the safety subspaces (identified once via Fisher eigendecomposition on the initial model) is load-bearing for the cumulative-erosion claim, yet no results are reported on subspace invariance after domain updates, principal angles between initial and post-update subspaces, or re-estimation of the subspaces. Without such checks, the 93.2% retention figure cannot be distinguished from a sequence-specific artifact.
[Experiments and results tables] Table reporting the main results (likely Table 2 or 3) gives aggregate safety retention and domain-task scores but omits per-domain breakdowns, statistical significance tests across runs, and ablations on the two free parameters (safety subspace rank and drift threshold). This makes it impossible to verify that the outperformance (18-42 points) is robust rather than tied to the specific three-domain ordering.

minor comments (2)

The abstract states evaluation on 'eight benchmarks' but does not name them or indicate which are safety vs. domain-task metrics; adding this list would improve clarity.
[Method section] Notation for the orthogonal projection operator and the safety loss used in the Fisher computation could be formalized with an equation to avoid ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of addressing safety preservation in the multi-domain continual adaptation setting. We address each major comment below and will incorporate revisions to strengthen the empirical support for our claims.

read point-by-point responses

Referee: The stability of the safety subspaces (identified once via Fisher eigendecomposition on the initial model) is load-bearing for the cumulative-erosion claim, yet no results are reported on subspace invariance after domain updates, principal angles between initial and post-update subspaces, or re-estimation of the subspaces. Without such checks, the 93.2% retention figure cannot be distinguished from a sequence-specific artifact.

Authors: We agree that explicit verification of subspace stability would strengthen the cumulative-erosion claims. The orthogonal projection is intended to prevent direct parameter updates within the safety subspace by construction, and the 93.2% retention is measured directly on safety benchmarks after the full sequence. To address potential indirect drift or sequence artifacts, the revised manuscript will add: principal angles between the initial safety subspace and the effective subspaces after each domain adaptation; overlap metrics for the top eigenvectors pre- and post-adaptation; and an ablation re-estimating the Fisher subspaces after each domain. These results will confirm that retention is attributable to the anchoring mechanism rather than the specific domain order. revision: yes
Referee: Table reporting the main results (likely Table 2 or 3) gives aggregate safety retention and domain-task scores but omits per-domain breakdowns, statistical significance tests across runs, and ablations on the two free parameters (safety subspace rank and drift threshold). This makes it impossible to verify that the outperformance (18-42 points) is robust rather than tied to the specific three-domain ordering.

Authors: We concur that granular reporting and ablations are needed to demonstrate robustness. The revised version will expand the results to include: per-domain breakdowns of safety retention and task performance for the medical-law-code pipeline; means and standard deviations over multiple random seeds with statistical significance tests (paired t-tests against baselines); and ablations on safety subspace rank (e.g., 8/16/32) and drift threshold (e.g., 0.05/0.1/0.2). We will also add results for a reversed domain ordering to show that outperformance (18-42 points) and near-parity with unconstrained fine-tuning (within 1.5 points) hold independently of sequence. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core method identifies safety subspaces once via standard Fisher Information eigendecomposition in LoRA space and applies an orthogonality constraint on subsequent gradients, followed by empirical monitoring. These steps rely on well-established techniques from continual learning (e.g., EWC-style importance weighting) without reducing any claimed result to a fitted quantity or self-referential definition by the paper's own equations. Performance figures such as 93.2% retention are reported from direct evaluation on Llama-2-7B and Mistral-7B across a fixed three-domain sequence rather than derived as predictions from the method itself. No load-bearing self-citations, ansatz smuggling, or uniqueness theorems imported from prior author work appear in the abstract or described pipeline. The derivation remains self-contained against external benchmarks and does not collapse to tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Review limited to abstract; full text would permit exhaustive enumeration. The central claim rests on the localization of safety to low-rank subspaces and the protective effect of orthogonal gradient constraints, both treated as domain assumptions rather than derived results.

free parameters (2)

safety subspace rank
Determined via eigendecomposition but specific cutoff value not stated in abstract and likely tuned to data.
drift detection threshold
Trigger level for corrective replay; value not provided in abstract.

axioms (2)

domain assumption Safety alignment resides in identifiable low-rank subspaces of LoRA parameter space
Invoked in the Fisher Information eigendecomposition step.
domain assumption Gradient updates restricted to the orthogonal complement of safety subspaces do not erode alignment
Core mechanism of the constraint step.

invented entities (1)

safety subspaces in LoRA parameter space no independent evidence
purpose: To localize and protect safety-critical directions during adaptation
Postulated via Fisher Information analysis; no independent evidence outside the method itself is mentioned.

pith-pipeline@v0.9.0 · 5503 in / 1501 out tokens · 61638 ms · 2026-05-10T05:19:46.509580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Continual Learning of Large Language Models: A Comprehensive Survey

H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y. Wang, Z. Wang, S. Ebrahimi, and H. Wang. “Continual Learning of Large Language Models: A Comprehensive Survey”. In:ACM Comput. Surv.58.5 (2026), 120:1–120:42

2026
[2]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang. “An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning”. In:arXiv preprintarXiv.2308.08747 (2023)

work page arXiv 2023
[3]

Safety Alignment Should be Made More Than Just a Few Tokens Deep

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. “Safety Alignment Should be Made More Than Just a Few Tokens Deep”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[4]

Language Models Resist Alignment: Evidence From Data Compression

J. Ji, K. Wang, T. A. Qiu, B. Chen, J. Zhou, C. Li, H. Lou, J. Dai, Y. Liu, and Y. Yang. “Language Models Resist Alignment: Evidence From Data Compression”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025. Ed. by W. Che, J. Nabende, E. S...

2025
[5]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

X. Qi, Y. Zeng, T. Xie, P. Chen, R. Jia, P. Mittal, and P. Henderson. “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” In:The Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[6]

X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, and D. Lin.Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. 2023

2023
[7]

Safety tax: Safety alignment makes your large reasoning models less reasonable

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, Z. Yahn, Y. Xu, and L. Liu. “Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable”. In:arXiv preprintarXiv.2503.00555 (2025)

work page arXiv 2025
[8]

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

T. Huang, S. Hu, and L. Liu. “Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack”. In:Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Ed. by A. Globersons, L. Mackey, D. Belgrav...

2024
[9]

Representation Noising: A Defence Mechanism Against Harmful Finetuning

D. Rosati, J. Wehner, K. Williams, L. Bartoszcze, R. Gonzales, C. Maple, S. Majumdar, H. Sajjad, and F. Rudzicz. “Representation Noising: A Defence Mechanism Against Harmful Finetuning”. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 1...

2024
[10]

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

C. Hsu, Y. Tsai, C. Lin, P. Chen, C. Yu, and C. Huang. “Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models”. In:arXiv preprintarXiv.2405.16833 (2024)

work page arXiv 2024
[11]

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

M. Li, W. M. Si, M. Backes, Y. Zhang, and Y. Wang. “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[12]

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, and L. Liu. “Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack”. In:Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Ed. by A. Globersons, L. Macke...

2024
[13]

B. Yi, J. Li, B. Zhang, L. Nie, T. Li, T. Huang, and Z. Liu.Gradient Surgery for Safe LLM Fine-Tuning. 2025

2025
[14]

Orthogonal Subspace Learning for Language Model Continual Learning

X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. “Orthogonal Subspace Learning for Language Model Continual Learning”. In:Findings of the Association for Com- putational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023, 10658–10671

2023
[15]

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Y. Liang and W. Li. “InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. IEEE, 2024, pp. 23638–23647

2024
[16]

Liang, J.-R

Y.-S. Liang, J.-R. Chen, and W.-J. Li.Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models. 2025

2025
[17]

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

L. Alssum, H. Itani, H. A. A. K. Hammoud, P. Torr, A. Bibi, and B. Ghanem. “Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning”. In:arXiv preprint arXiv.2512.10150 (2025)

work page arXiv 2025
[18]

G. Sun, S. Zhang, L. Wang, J. Zhu, H. Su, and Y. Zhong.Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection. 2026

2026
[19]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. “Refusal in Language Models Is Mediated by a Single Direction”. In:Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Ed. by A. Globersons, L. Mac...

2024
[20]

The hidden di- mensions of llm alignment: A multi-dimensional anal- ysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

W. Pan, Z. Liu, Q. Chen, X. Zhou, H. Yu, and X. Jia. “The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis”. In:arXiv preprintarXiv.2502.09674 (2025)

work page arXiv 2025
[21]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al. “Representation Engineering: A Top-Down Approach to AI Transparency”. In:arXiv preprintarXiv.2310.01405 (2023)

work page internal anchor Pith review arXiv 2023
[22]

Orthogonal Gradient Descent for Continual Learning

M. Farajtabar, N. Azizan, A. Mott, and A. Li. “Orthogonal Gradient Descent for Continual Learning”. In:The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]. Ed. by S. Chiappa and R. Calandra. Vol. 108. Proceedings of Machine Learning Research. PMLR, 2020, pp. 3762–3773

2020
[23]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations”. In: arXiv preprintarXiv.2312.06674 (2023)

work page internal anchor Pith review arXiv 2023
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In:arXiv preprint arXiv.2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Mistral 7B

A. Q. Jiang et al. “Mistral 7B”. In:arXiv preprintarXiv.2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

D. Jin, E. Pan, N. Oufattole, W. Weng, H. Fang, and P. Szolovits. “What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams”. In:arXiv preprintarXiv.2009.13081 (2020)

work page arXiv 2009
[27]

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

N. Guha et al. “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models”. In:Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

2023
[28]

Evaluating Large Language Models Trained on Code

M. Chen et al. “Evaluating Large Language Models Trained on Code”. In:arXiv preprint arXiv.2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. A. Forsyth, and D. Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”. In:Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Ed. by R. Salakhutdinov, Z. Kolt...

2024
[30]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

S. Lin, J. Hilton, and O. Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. Ed. by S. Muresan, P. Nakov, and A. Villavicencio. Association for Computational Linguistics, 2022, pp. ...

2022
[31]

BBQ: A hand-built bias benchmark for question answering

A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman. “BBQ: A hand-built bias benchmark for question answering”. In:Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022. Ed. by S. Muresan, P. Nakov, and A. Villavicencio. Vol. ACL 2022. Findings of ACL. Associati...

2022
[32]

WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Y. Choi, N. Dziri, A. Ettinger, S. Han, L. Jiang, N. Lambert, B. Y. Lin, and K. Rao. “WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs”. In:Advances in Neural Information Processing Systems 37. NeurIPS 2024. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024, 8093–8131

2024
[33]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. “Measuring Massive Multitask Language Understanding”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

2021
[34]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, and L. Liu. “Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey”. In:arXiv preprintarXiv.2409.18169 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

LoRA: Low- Rank Adaptation of Large Language Models

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. “LoRA: Low- Rank Adaptation of Large Language Models”. In:The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

2022
[36]

Overcoming catastrophic forgetting in neural networks

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the National Academy of Sciences114.13 (2017), 3521–3526.issn: 1091-6490

2017
[37]

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li. “Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch”. In:Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Ed. by R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp. Vol. 235....

2024
[38]

Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

D. Wang, Q. Ma, Y. Shang, Z. Lu, L. Ning, Z. Xu, H. Wu, and Z. He. “Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation”. In:arXiv preprintarXiv.2512.23260 (2025)

work page arXiv 2025
[39]

Safe RLHF: Safe Rein- forcement Learning from Human Feedback

J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. “Safe RLHF: Safe Rein- forcement Learning from Human Feedback”. In:The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[40]

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang. “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset”. In:Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023...

2023
[41]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. In:Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decemb...

2023
[42]

Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. 2023

2023
[43]

PEFT: State-of- the-art Parameter-Efficient Fine-Tuning Methods

S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, and M. Tietz. “PEFT: State-of- the-art Parameter-Efficient Fine-Tuning Methods”. In:GitHub repository(2022)

2022
[44]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson. “Universal and Transferable Adversarial Attacks on Aligned Language Models”. In:arXiv preprintarXiv.2307.15043 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Tamper-Resistant Safeguards for Open-Weight LLMs

R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arel, A. Zou, D. Song, B. Li, D. Hendrycks, and M. Mazeika. “Tamper-Resistant Safeguards for Open-Weight LLMs”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[46]

DoRA: Weight- Decomposed Low-Rank Adaptation

S. Liu, C. Wang, H. Yin, P. Molchanov, Y. F. Wang, K. Cheng, and M. Chen. “DoRA: Weight- Decomposed Low-Rank Adaptation”. In:Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Ed. by R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp. Vol. 235. Proceedings of...

2024