A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
DemaFormer pairs energy-based modeling with a damped-EMA Transformer to localize video moments matching language queries and reports gains over baselines on four datasets.
citing papers explorer
-
Multi-Scale Contrastive Learning for Video Temporal Grounding
A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.
-
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
DemaFormer pairs energy-based modeling with a damped-EMA Transformer to localize video moments matching language queries and reports gains over baselines on four datasets.