Mer- lot: Multimodal neural script knowledge models.Advances in neural information processing systems, 34:23634–23651,

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.

citing papers explorer

Showing 2 of 2 citing papers.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 66
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms cs.CV · 2026-04-10 · unverdicted · none · ref 36
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.

Mer- lot: Multimodal neural script knowledge models.Advances in neural information processing systems, 34:23634–23651,

fields

years

verdicts

representative citing papers

citing papers explorer