VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Simple open-vocabulary object detection
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
citing papers explorer
-
VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
-
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.