BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Backdoor defense for LLMs detects anomalous attention-head similarity on triggers and applies head-wise alignment via fine-tuning to reduce attack success.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
citing papers explorer
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
-
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
Backdoor defense for LLMs detects anomalous attention-head similarity on triggers and applies head-wise alignment via fine-tuning to reduce attack success.
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.