T-REN learns compact text-aligned region tokens from frozen vision features to strengthen dense cross-modal alignment and enable scalable processing of images and videos.
IJCV (2020) 16 S
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
T-REN learns compact text-aligned region tokens from frozen vision features to strengthen dense cross-modal alignment and enable scalable processing of images and videos.