A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.
Antibody: Strengthening defense against harmful fine-tuning for large language models via attenuating harmful gradient influence.arXiv preprint arXiv:2603.00498, 2026
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.