Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Gunnar A. Sigurdsson , Abhinav Gupta , Cordelia Schmid , Ali Farhadi , Karteek Alahari

Authors on Pith no claims yet

classification 💻 cs.CV

keywords charades-egodatasetvideoactivityegocentricfirstthird-personannotations

read the original abstract

In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances. Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
cs.CV 2022-04 unverdicted novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
cs.CV 2023-10 unverdicted novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction
cs.CV 2026-04 conditional novelty 5.0

Dynamic 3DGS models achieve lower PSNR on egocentric videos than exocentric ones, with the gap arising from static content reconstruction.