End-to-end dense video captioning as sequence generation

Wanrong Zhu, Bo Pang, Ashish V Thapliyal, William Yang Wang, Radu Soricut · 2022 · arXiv 2204.08121

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

cs.CV · 2024-12-04 · unverdicted · novelty 5.0

TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.

citing papers explorer

Showing 2 of 2 citing papers.

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 92
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos cs.CV · 2024-12-04 · unverdicted · none · ref 60
TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.

End-to-end dense video captioning as sequence generation

fields

years

verdicts

representative citing papers

citing papers explorer