pith. machine review for the scientific record. sign in

arxiv: 1906.02467 · v1 · submitted 2019-06-06 · 💻 cs.CV

Recognition: unknown

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Authors on Pith no claims yet
classification 💻 cs.CV
keywords datasetvideoqaactivitynet-qaansweringquestionscalevideovideos
0
0 comments X
read the original abstract

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos. The dataset is available at https://github.com/MILVLG/activitynet-qa

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.