pith. sign in

arxiv: 2408.14469 · v1 · pith:DDIELDJInew · submitted 2024-08-26 · 💻 cs.CV

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

classification 💻 cs.CV
keywords groundingmulti-hopvideosevidencetaskvisualarchitecturebenchmark
0
0 comments X
read the original abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

    cs.CV 2026-05 unverdicted novelty 7.0

    Egostream introduces a diagnostic benchmark that expands 2,250 questions into 8,528 recall-conditioned evaluations to measure streaming episodic memory performance across detail, spatial, temporal, event, social, caus...