Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.CV 3years
2026 3roles
method 1polarities
use method 1representative citing papers
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
citing papers explorer
-
Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.