Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.