Safety-aligned weights are not enough: Refusal-teacher-guided finetuning enhances safety and downstream performance under harmful finetuning attacks, 2025

Ham, S · 2025 · arXiv 2506.07356

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

cs.AI · 2026-05-23 · unverdicted · novelty 6.0

A new fine-tuning defense uses temporary jailbreaking induced by BufferLoRA to limit harmful updates, followed by merging with ReinforceLoRA via QR decomposition to restore safety while keeping task performance.

citing papers explorer

Showing 1 of 1 citing paper.

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models cs.AI · 2026-05-23 · unverdicted · none · ref 11
A new fine-tuning defense uses temporary jailbreaking induced by BufferLoRA to limit harmful updates, followed by merging with ReinforceLoRA via QR decomposition to restore safety while keeping task performance.

Safety-aligned weights are not enough: Refusal-teacher-guided finetuning enhances safety and downstream performance under harmful finetuning attacks, 2025

fields

years

verdicts

representative citing papers

citing papers explorer