pith. machine review for the scientific record. sign in

arxiv: 2605.01913 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.AI· cs.CE· cs.CL· cs.CR

Recognition: unknown

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CEcs.CLcs.CR
keywords LLM safetyfine-tuningrepresentation driftrefusal behavioradversarial robustnessgeometric preservationhidden activations
0
0 comments X

The pith

RefusalGuard constrains updates in hidden representation space to keep safety features stable during fine-tuning of LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning causes safety-relevant representations to drift and their geometry to distort, which increases a model's willingness to comply with harmful requests. RefusalGuard counters this by limiting how much those specific representations can change while still allowing the model to learn new tasks in other directions. This keeps refusal behavior close to the original safety-aligned model without sacrificing much performance on utility benchmarks. A sympathetic reader would care because it offers a way to adapt models to specific uses while reducing the risk that they become easier to misuse.

Core claim

RefusalGuard is a representation-level fine-tuning framework that preserves the geometric structure of safety-relevant representations by constraining updates in hidden activation space. This prevents systematic drift and interference between task optimization and safety features, resulting in attack success rates on adversarial benchmarks that match those of the base safety-aligned models while maintaining competitive performance on downstream tasks across LLaMA, Gemma, and Qwen families.

What carries the argument

RefusalGuard, a constraint on updates within hidden representation space that isolates and stabilizes safety-mediating components.

If this is right

  • Fine-tuned models retain refusal rates close to their original aligned versions on adversarial prompts.
  • The method works across multiple model families without requiring changes to the loss function or data.
  • Task-specific learning proceeds in directions orthogonal to the preserved safety geometry.
  • Interference between safety and utility objectives is reduced at the representation level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric preservation idea could be applied to other alignment targets such as truthfulness or helpfulness if their representations can be identified.
  • If safety geometry is stable, periodic re-application of the constraint might allow longer sequences of task adaptations without repeated safety degradation.
  • This approach implies that safety can be treated as a separable subspace during adaptation rather than being re-learned from scratch each time.

Load-bearing premise

That the safety-relevant representations and their geometric structure remain stable enough across different fine-tuning scenarios to preserve refusal behavior without blocking necessary task learning.

What would settle it

If applying RefusalGuard's representation constraints during fine-tuning on a new task causes attack success rates to rise above those of the base model on AdvBench or JailbreakBench while task accuracy drops more than 5 percent relative to unconstrained fine-tuning.

Figures

Figures reproduced from arXiv: 2605.01913 by Mohammad Mohammadi Amiri, Sadia Asif.

Figure 1
Figure 1. Figure 1: Mechanistic degradation under harmful fine-tuning on [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reduction in attack success rate (ASR) relative to LoRA fine-tuning for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation on the geometry-preservation coefficient [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
read the original abstract

Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within the model's activation space, how these representations change during fine-tuning and why alignment degrades remains poorly understood. In this work, we investigate the representation-level mechanisms underlying alignment degradation. Our analysis shows that standard fine-tuning induces systematic drift in safety-relevant representations, distorts their geometric structure, and introduces interference between task optimization and safety features. These effects collectively lead to increased harmful compliance. Motivated by these findings, we introduce REFUSALGUARD, a representation-level fine-tuning framework that preserves safety-relevant structure during model adaptation. Our approach constrains updates in hidden representation space, ensuring that safety-mediating components remain stable while allowing task-specific learning in complementary directions. We evaluate REFUSALGUARD across multiple model families, including LLaMA, Gemma, and Qwen, on adversarial safety benchmarks such as AdvBench, DirectHarm4, and JailbreakBench, as well as downstream utility tasks. Our approach achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance, significantly outperforming baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that fine-tuning safety-aligned LLMs for downstream tasks induces systematic drift in safety-relevant representations, distorts their geometric structure, and causes interference that increases harmful compliance. It introduces RefusalGuard, a representation-level fine-tuning framework that constrains updates in hidden representation space to preserve safety-mediating components while allowing task-specific learning in complementary directions. The method is evaluated on LLaMA, Gemma, and Qwen models using adversarial safety benchmarks (AdvBench, DirectHarm4, JailbreakBench) and downstream utility tasks, with the claim that it achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance and outperforming baselines.

Significance. If the empirical results hold under detailed scrutiny, this work would offer a practical approach to mitigating alignment degradation during fine-tuning, which is a pressing issue for safe LLM deployment. The emphasis on preserving geometric structure in safety-relevant representations provides a potentially useful lens beyond standard alignment techniques. Credit is given for the multi-family model evaluation and the attempt to connect representation drift to safety outcomes, though the lack of supporting quantitative evidence currently limits its assessed impact.

major comments (1)
  1. [Abstract] Abstract: The central claim that REFUSALGUARD 'achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance' and 'significantly outperforming baselines' is presented without any quantitative metrics, benchmark scores, ablation studies, implementation details of the constraint, or error analysis. This is load-bearing for the empirical contribution, as it prevents verification of the result.
minor comments (1)
  1. [Abstract] The abstract refers to 'constrains updates in hidden representation space' and 'preserves safety-relevant structure' without referencing the specific section, equation, or algorithm that defines the constraint or the geometric preservation mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that REFUSALGUARD 'achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance' and 'significantly outperforming baselines' is presented without any quantitative metrics, benchmark scores, ablation studies, implementation details of the constraint, or error analysis. This is load-bearing for the empirical contribution, as it prevents verification of the result.

    Authors: We agree that the abstract would be improved by including concrete quantitative results to support its claims. In the revised manuscript we will update the abstract to report specific attack success rates on AdvBench, DirectHarm4, and JailbreakBench for RefusalGuard versus the base safety-aligned models and the baselines, together with the corresponding downstream task performance numbers. These metrics appear in Tables 1–3 and Figures 2–4 of the current manuscript; placing the key values in the abstract will make the central empirical contribution immediately verifiable. The implementation details of the geometric constraint, the ablation studies, and error analysis are already provided in Sections 3.2, 4.3, and the appendix; we will add a brief parenthetical reference to these sections in the revised abstract if space permits. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical investigation of representation drift during fine-tuning followed by a proposed constraint-based method (REFUSALGUARD) evaluated on external adversarial and utility benchmarks. No equations, parameter-fitting steps, or predictions are described that reduce by construction to the inputs or to self-citations. The central claims rest on observed experimental outcomes rather than any self-definitional or load-bearing self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities beyond the method name itself are detailed.

invented entities (1)
  • RefusalGuard framework no independent evidence
    purpose: Constrain updates to preserve safety-relevant geometric structure during fine-tuning
    New method introduced to address observed representation drift and interference.

pith-pipeline@v0.9.0 · 5528 in / 1108 out tokens · 46769 ms · 2026-05-10T15:42:54.736362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  2. [2]

    Alignguard-lora: Alignment- presving fine-tuning via fisher-guided decomposition and riemannian-geodesic collision regularization.arXiv preprint arXiv:2508.02079,

    Amitava Das, Abhilekh Borah, Vinija Jain, and Aman Chadha. Alignguard-lora: Alignment- presving fine-tuning via fisher-guided decomposition and riemannian-geodesic collision regularization.arXiv preprint arXiv:2508.02079,

  3. [4]

    Gemma: Open Models Based on Gemini Research and Technology

    URLhttps://arxiv.org/abs/2403.08295. Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InOpenReview (DIG-BUG Workshop),

  4. [6]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps://arxiv.org/abs/2106.09685. Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates. InAdvances in Neural Information Processing Systems (NeurIPS),

  5. [7]

    Refusal in llms is an affine function

    Thomas Marshall, Adam Scherlis, and Nora Belrose. Refusal in llms is an affine function. arXiv preprint arXiv:2411.09003,

  6. [8]

    Refusal in llms is an affine function

    doi: 10.48550/arXiv.2411.09003. URL https:// arxiv.org/abs/2411.09003. 10 Preprint. Under review. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv pr...

  7. [9]

    V ., and Mitchell, M

    Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc bench- mark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141,

  8. [10]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693,

  9. [11]

    Cosmic: Generalized refusal direction identification in llm activations

    Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. Cosmic: Generalized refusal direction identification in llm activations. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 25534–25553, July 27–August 1, 2025,

  10. [12]

    Association for Computational Linguistics. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13...

  11. [14]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Ashkan Yousefpour, Taeheon Kim, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, and Jonghyun Choi. Representation bend- ing for large language model safety. InAnnual Meeting of the Association for Computational Linguistics (ACL),

  12. [16]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URLhttps://arxiv.org/abs/2307.15043. Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. InAdvances in Neural Information Processing Systems (NeurIPS),

  13. [17]

    Under review

    12 Preprint. Under review. 6 Appendix 6.1 Algorithm Algorithm 1REFUSALGUARD: geometry-preserving representation fine-tuning Require: Frozen aligned model fθ0; intervention layer set S; edited position sets {Pℓ}ℓ∈S ; down- stream datasetD; geometry-preservation coefficientλ geom 1: Extract layer-specific refusal-cone bases {B(ℓ) ref }ℓ∈S and construct the ...

  14. [18]

    This shows that the underlying issue is not restricted to overtly unsafe data: task-driven adaptation can also interfere with refusal-mediating 18 Preprint

    Across both models, standard LoRA fine-tuning on GSM8K still causes substantial safety degradation relative to the aligned base model, although the drop is less extreme than under direct harmful adaptation. This shows that the underlying issue is not restricted to overtly unsafe data: task-driven adaptation can also interfere with refusal-mediating 18 Pre...

  15. [19]

    Drift alone tells us that the refusal subspace is moving, but the coordinate statistics show that the geometry is also becoming internally imbalanced. Together, these observations support the interpretation that standard fine-tuning does not merely under-activate the refusal representations; it actively reshapes it into a less stable form. 6.4.3 Per-Layer...