Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.
Axiomatic attribution for deep networks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.
citing papers explorer
-
Mechanisms of Object Localization in Vision-Language Models
Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.
-
Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.