CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.
A video is worth three views: Trigeminal transformers for video-based person re- identification.IEEE Transactions on Intelligent Transporta- tion Systems, 25(9):12818–12828
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.