Differential SAEs isolate backdoor features far better than Crosscoders, reaching a Backdoor Isolation Score of 0.40 with perfect precision while Crosscoders stay below 0.02.
Fine-pruning: Defending against backdooring attacks on deep neural networks
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
Differential SAEs isolate backdoor features far better than Crosscoders, reaching a Backdoor Isolation Score of 0.40 with perfect precision while Crosscoders stay below 0.02.