Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
citing papers explorer
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.