pith. sign in

arxiv: 2110.07058 · v3 · pith:WZOPS6BKnew · submitted 2021-10-13 · 💻 cs.CV · cs.AI

Ego4D: Around the World in 3,000 Hours of Egocentric Video

classification 💻 cs.CV cs.AI
keywords videoegocentricbenchmarkego4darounddatasetfirst-personhours
0
0 comments X
read the original abstract

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  2. Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.

  3. HRDexDB: A Paired Human-Robot Dataset for Cross-Embodiment Dexterous Grasping

    cs.RO 2026-04 unverdicted novelty 7.0

    HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.

  4. How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

    cs.CY 2025-10 accept novelty 7.0

    Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.

  5. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    cs.CV 2022-04 unverdicted novelty 7.0

    Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

  6. PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

    cs.RO 2026-06 unverdicted novelty 6.0

    PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.

  7. Contrastive Action-Image Pre-training for Visuomotor Control

    cs.RO 2026-06 unverdicted novelty 6.0

    CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.

  8. Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces pause-and-think-T dataset and pause-and-think-B benchmark for video-grounded assistive action suggestion, enabling a 4B VLM to match larger models on reasoning tasks and generalize to EgoThink and TempCompass.

  9. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  10. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  11. RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

    cs.RO 2026-04 unverdicted novelty 6.0

    RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.

  12. HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    cs.CV 2026-01 unverdicted novelty 6.0

    HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

  13. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    cs.CV 2023-08 unverdicted novelty 6.0

    DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.

  14. The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 5.0

    Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

  15. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  16. World Action Models: A Survey

    cs.RO 2026-06 unverdicted novelty 3.0

    A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.