arxiv: 2602.15195 · v3 · submitted 2026-02-16 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: no theorem link

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano , Ekaterina Vasyagina , Kevin Zhu , Javier Ferrando , Maheep Chaudhary

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords LoRA adaptersbackdoor detectionweight space analysisspectral statisticsLLM securitymodel poisoningfine-tuning attacks

0 comments

The pith

Spectral statistics from LoRA weight updates detect backdoors with 100% accuracy across three model families without running the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that backdoors in LoRA adapters can be identified directly from their weight matrices by pulling five spectral statistics from the low-rank update on each of the four attention projections. This produces a 20-dimensional vector that a logistic regression separates into benign and poisoned classes. The approach is trigger-agnostic and requires no model inference, unlike prior detectors that need test inputs. It holds across Llama, Qwen, and Gemma models on instruction, reasoning, code, and classification tasks. A reader cares because it enables fast screening of thousands of shared adapters on public hubs.

Core claim

For each attention projection Q, K, V, O, five spectral statistics are extracted from the low-rank update ΔW to form a 20-dimensional signature per adapter. A logistic regression trained on this signature separates benign and poisoned adapters across Llama-3.2-3B, Qwen2.5-3B, and Gemma-2-2B, reaching 100% accuracy on unseen test adapters from multiple task domains.

What carries the argument

The 20-dimensional signature built from five spectral statistics on the low-rank update ΔW for each of the Q, K, V, O attention projections.

Load-bearing premise

The five spectral statistics per attention projection stay discriminative for backdoors even when poisoning method, trigger, or task differs from the training distribution.

What would settle it

A backdoored LoRA adapter from a new poisoning method or unseen model scale whose 20-dimensional spectral vector is classified as benign by the logistic regression.

read the original abstract

LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update $\Delta W$, yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3}, Qwen2.5-3B~\citep{qwen25}, and Gemma-2-2B~\citep{gemma2} -- on unseen test adapters drawn from instruction-following, reasoning, question-answering, code, and classification tasks. Across all three architectures, the detector achieves 100\% accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spectral stats on LoRA deltas separate clean and poisoned adapters at 100% in their tests, but the result may be tied to a narrow set of injection methods rather than a general backdoor signature.

read the letter

The paper extracts five spectral statistics from the low-rank delta-W matrices in each attention projection, builds a 20-dimensional vector, and trains logistic regression to flag poisoned LoRA adapters. It reports perfect accuracy on unseen adapters from Llama-3.2, Qwen2.5, and Gemma-2 across instruction, reasoning, QA, code, and classification tasks, all without running the model or knowing the trigger. That is the concrete advance: a purely weight-space, trigger-agnostic screen that could be run at scale on repositories like Hugging Face. The approach is straightforward and avoids the circularity problem by computing features directly from the matrices. Credit is due for testing across three model families and multiple downstream tasks. The main soft spot is the missing detail on how the poisoned adapters were created. The abstract and stress-test note give no numbers on training-set size, no description of the backdoor insertion recipe, and no indication that the poisoned examples used varied triggers or optimization objectives. If every poisoned adapter was generated the same way, the spectral features could simply be picking up method-specific distortions in the low-rank factors instead of a reliable backdoor property. That leaves the generalization claim untested. A reader working on model supply-chain security would find the idea worth examining once the full experimental protocol is available. The work is coherent on its own terms and shows clear thinking about the practical constraints of adapter screening. It deserves peer review because the core representation is new enough and the reported performance high enough to justify referee time, even if revisions will be needed to address attack diversity.

Referee Report

2 major / 2 minor

Summary. The paper claims a trigger-agnostic backdoor detector for LoRA adapters that extracts five spectral statistics from the low-rank update ΔW in each of the Q, K, V, and O attention projections (yielding a 20-dimensional feature vector per adapter) and trains a logistic regression classifier on labeled examples; the detector is reported to achieve 100% accuracy on unseen test adapters spanning multiple tasks and three model families (Llama-3.2-3B, Qwen2.5-3B, Gemma-2-2B) without requiring model inference or knowledge of the trigger.

Significance. If the central empirical claim holds under varied conditions, the work would be significant for enabling scalable, inference-free screening of adapters in public repositories such as Hugging Face Hub. The weight-space approach is practically attractive because it avoids the need to know or execute triggers and operates directly on the low-rank factors that users actually share.

major comments (2)

[Abstract and experimental evaluation] The abstract and experimental description report 100% accuracy on unseen adapters but supply no information on backdoor generation method, training-set size, number of poisoned examples, or whether the test poisoned adapters employ the same insertion technique as the training poisoned adapters. If all poisoned examples share one insertion process, the 20-dimensional spectral features can separate classes by capturing method-specific distortions in the low-rank factors rather than a trigger-agnostic backdoor property.
[Experimental evaluation] The central generalization claim rests on the assumption that the five spectral statistics per projection remain discriminative when poisoning method, trigger construction, or downstream task differs from the training distribution. The manuscript provides no cross-method experiments or ablation that varies the backdoor insertion process, leaving the load-bearing claim unsupported by the reported evidence.

minor comments (2)

[Method] The exact definitions and formulas for the five spectral statistics computed on ΔW should be stated explicitly (e.g., which norms, eigenvalues, or singular-value summaries are used) so that the feature extraction is fully reproducible.
[Results] Table or figure captions should report the number of benign and poisoned adapters in each train/test split and the precise backdoor injection parameters used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We will revise the manuscript to provide the requested details on the experimental setup and to strengthen the evidence for generalization. Our responses to the major comments are as follows.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The abstract and experimental description report 100% accuracy on unseen adapters but supply no information on backdoor generation method, training-set size, number of poisoned examples, or whether the test poisoned adapters employ the same insertion technique as the training poisoned adapters. If all poisoned examples share one insertion process, the 20-dimensional spectral features can separate classes by capturing method-specific distortions in the low-rank factors rather than a trigger-agnostic backdoor property.

Authors: We agree that these details are necessary for a complete evaluation of the claims. The manuscript will be revised to include a full description of the backdoor generation method, which involves injecting a specific trigger during the fine-tuning of LoRA adapters on poisoned datasets. We will specify the training-set size (number of benign and poisoned adapters used for training the detector) and the number of poisoned examples. The test poisoned adapters were generated using the same insertion technique to ensure consistency in evaluating the spectral features' ability to detect backdoors, while varying the triggers and tasks to demonstrate trigger-agnostic detection. We acknowledge the referee's point that this may capture method-specific distortions; to address this, we will add a discussion of this limitation and include additional experiments with varied insertion processes in the revised version. revision: yes
Referee: [Experimental evaluation] The central generalization claim rests on the assumption that the five spectral statistics per projection remain discriminative when poisoning method, trigger construction, or downstream task differs from the training distribution. The manuscript provides no cross-method experiments or ablation that varies the backdoor insertion process, leaving the load-bearing claim unsupported by the reported evidence.

Authors: The current experiments demonstrate generalization across different downstream tasks and model families while using a consistent backdoor insertion method. This supports the trigger-agnostic aspect of the detector. However, we recognize that varying the poisoning method itself is important for the broader claim. In the revised manuscript, we will add an ablation study that varies the backdoor insertion process, such as different poisoning rates and trigger constructions, to provide stronger evidence. We will also clarify in the text that the primary contribution is trigger-agnostic detection for a given insertion method. revision: partial

Circularity Check

0 steps flagged

No significant circularity in feature extraction or classification pipeline

full rationale

The paper extracts five spectral statistics directly from the low-rank update ΔW for each attention projection, producing a 20-dimensional vector with no label-dependent fitting or parameter estimation inside the feature step. Logistic regression is then trained on these fixed features using labeled adapters in a standard supervised setup, and accuracy is reported on held-out test adapters. No equations reduce a claimed result to its own inputs by construction, no self-citations are load-bearing for the core method, and no ansatz or uniqueness theorem is smuggled in. The pipeline is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that backdoor poisoning consistently alters the spectral properties of the low-rank updates in a way separable by a linear classifier; the logistic regression coefficients are fitted parameters.

free parameters (1)

logistic regression coefficients
The classifier weights are fitted to separate the 20-dimensional signatures of benign versus poisoned training adapters.

axioms (1)

domain assumption Spectral statistics of ΔW capture backdoor signatures independently of trigger and task
Invoked when claiming the 20-dimensional signature generalizes to unseen adapters from instruction-following, reasoning, QA, code, and classification tasks.

pith-pipeline@v0.9.0 · 5529 in / 1287 out tokens · 26009 ms · 2026-05-15T21:30:06.146933+00:00 · methodology

Weight space Detection of Backdoors in LoRA Adapters

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)