A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.
arXiv preprint arXiv:2510.10085 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
A new fine-tuning defense uses temporary jailbreaking induced by BufferLoRA to limit harmful updates, followed by merging with ReinforceLoRA via QR decomposition to restore safety while keeping task performance.
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.
SPARD defends LLMs from harmful fine-tuning attacks via alternating safety projections and relevance-diversity DPP data selection, reporting lowest attack success rates on GSM8K and OpenBookQA while keeping task accuracy.
citing papers explorer
-
Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
A new fine-tuning defense uses temporary jailbreaking induced by BufferLoRA to limit harmful updates, followed by merging with ReinforceLoRA via QR decomposition to restore safety while keeping task performance.