Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.
Weight space Detection of Backdoors in LoRA Adapters
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update $\Delta W$, yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3}, Qwen2.5-3B~\citep{qwen25}, and Gemma-2-2B~\citep{gemma2} -- on unseen test adapters drawn from instruction-following, reasoning, question-answering, code, and classification tasks. Across all three architectures, the detector achieves 100\% accuracy.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Building Better Activation Oracles
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.