Retrieval-Augmented Egocentric Video Captioning

Guo Chen; Jilan Xu; Junlin Hou; Rui Feng; Weidi Xie; Yifei Huang; Yuejie Zhang

arxiv: 2401.00789 · v4 · pith:PMMYPSKWnew · submitted 2024-01-01 · 💻 cs.CV

Retrieval-Augmented Egocentric Video Captioning

Jilan Xu , Yifei Huang , Junlin Hou , Guo Chen , Yuejie Zhang , Rui Feng , Weidi Xie This is my paper

classification 💻 cs.CV

keywords egocentricvideosvideocaptioningcross-viewegoinstructormoduleretrieval

0 comments

read the original abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.