Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abrham Gebreselasie; Adriano Fragomeni; Akshay Erapalli; Andrew Westbury; Antonino Furnari; Antonio Torralba; Anurag Kumar; Aude Oliva; Audrey Southerland; Bernard Ghanem

arxiv: 2110.07058 · v3 · pith:WZOPS6BKnew · submitted 2021-10-13 · 💻 cs.CV · cs.AI

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Kristen Grauman , Andrew Westbury , Eugene Byrne , Zachary Chavis , Antonino Furnari , Rohit Girdhar , Jackson Hamburger , Hao Jiang

show 77 more authors

Miao Liu Xingyu Liu Miguel Martin Tushar Nagarajan Ilija Radosavovic Santhosh Kumar Ramakrishnan Fiona Ryan Jayant Sharma Michael Wray Mengmeng Xu Eric Zhongcong Xu Chen Zhao Siddhant Bansal Dhruv Batra Vincent Cartillier Sean Crane Tien Do Morrie Doulaty Akshay Erapalli Christoph Feichtenhofer Adriano Fragomeni Qichen Fu Abrham Gebreselasie Cristina Gonzalez James Hillis Xuhua Huang Yifei Huang Wenqi Jia Weslie Khoo Jachym Kolar Satwik Kottur Anurag Kumar Federico Landini Chao Li Yanghao Li Zhenqiang Li Karttikeya Mangalam Raghava Modhugu Jonathan Munro Tullie Murrell Takumi Nishiyasu Will Price Paola Ruiz Puentes Merey Ramazanova Leda Sari Kiran Somasundaram Audrey Southerland Yusuke Sugano Ruijie Tao Minh Vo Yuchen Wang Xindi Wu Takuma Yagi Ziwei Zhao Yunyi Zhu Pablo Arbelaez David Crandall Dima Damen Giovanni Maria Farinella Christian Fuegen Bernard Ghanem Vamsi Krishna Ithapu C. V. Jawahar Hanbyul Joo Kris Kitani Haizhou Li Richard Newcombe Aude Oliva Hyun Soo Park James M. Rehg Yoichi Sato Jianbo Shi Mike Zheng Shou Antonio Torralba Lorenzo Torresani Mingfei Yan Jitendra Malik

This is my paper

classification 💻 cs.CV cs.AI

keywords videoegocentricbenchmarkego4darounddatasetfirst-personhours

0 comments

read the original abstract

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
cs.CV 2026-05 unverdicted novelty 7.0

Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.
HRDexDB: A Paired Human-Robot Dataset for Cross-Embodiment Dexterous Grasping
cs.RO 2026-04 unverdicted novelty 7.0

HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
cs.CY 2025-10 accept novelty 7.0

Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
cs.CV 2022-04 unverdicted novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
cs.RO 2026-06 unverdicted novelty 6.0

PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.
Contrastive Action-Image Pre-training for Visuomotor Control
cs.RO 2026-06 unverdicted novelty 6.0

CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
cs.CV 2026-05 unverdicted novelty 6.0

Introduces pause-and-think-T dataset and pause-and-think-B benchmark for video-grounded assistive action suggestion, enabling a 4B VLM to match larger models on reasoning tasks and generalize to EgoThink and TempCompass.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
cs.RO 2026-04 unverdicted novelty 6.0

RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
cs.CV 2026-01 unverdicted novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.