pith. sign in

arxiv: 1804.09626 · v2 · pith:C6WJO2VVnew · submitted 2018-04-25 · 💻 cs.CV

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

classification 💻 cs.CV
keywords charades-egodatasetvideoactivityegocentricfirstthird-personannotations
0
0 comments X
read the original abstract

In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances. Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

  2. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    cs.CV 2022-04 unverdicted novelty 7.0

    Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

  3. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  4. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    cs.CV 2023-10 unverdicted novelty 6.0

    LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

  5. Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction

    cs.CV 2026-04 conditional novelty 5.0

    Dynamic 3DGS models achieve lower PSNR on egocentric videos than exocentric ones, with the gap arising from static content reconstruction.

  6. LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

    cs.CV 2025-01 unverdicted novelty 5.0

    LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.