InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Mer- lot: Multimodal neural script knowledge models.Advances in neural information processing systems, 34:23634–23651,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
citing papers explorer
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.