pith. machine review for the scientific record. sign in

arxiv: 1804.09626 · v2 · submitted 2018-04-25 · 💻 cs.CV

Recognition: unknown

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Authors on Pith no claims yet
classification 💻 cs.CV
keywords charades-egodatasetvideoactivityegocentricfirstthird-personannotations
0
0 comments X
read the original abstract

In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances. Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    cs.CV 2022-04 unverdicted novelty 7.0

    Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

  2. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  3. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    cs.CV 2023-10 unverdicted novelty 6.0

    LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

  4. Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction

    cs.CV 2026-04 conditional novelty 5.0

    Dynamic 3DGS models achieve lower PSNR on egocentric videos than exocentric ones, with the gap arising from static content reconstruction.