pith. sign in

Weight space Detection of Backdoors in LoRA Adapters

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update $\Delta W$, yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3}, Qwen2.5-3B~\citep{qwen25}, and Gemma-2-2B~\citep{gemma2} -- on unseen test adapters drawn from instruction-following, reasoning, question-answering, code, and classification tasks. Across all three architectures, the detector achieves 100\% accuracy.

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Building Better Activation Oracles

cs.LG · 2026-05-23 · unverdicted · novelty 3.0

Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.

citing papers explorer

Showing 1 of 1 citing paper.

  • Building Better Activation Oracles cs.LG · 2026-05-23 · unverdicted · none · ref 9 · internal anchor

    Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.