EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Baoqi Pei; Guo Chen; Jilan Xu; Junlin Hou; Qingqiu Li; Rui Feng; Weidi Xie; Yifei Huang; Yuejie Zhang

arxiv: 2504.11732 · v1 · pith:ZNYMSRP5new · submitted 2025-04-16 · 💻 cs.CV

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Jilan Xu , Yifei Huang , Baoqi Pei , Junlin Hou , Qingqiu Li , Guo Chen , Yuejie Zhang , Rui Feng

show 1 more author

Weidi Xie

This is my paper

classification 💻 cs.CV

keywords videopredictionego-centricegoexo-genmasksvideoscross-viewfirst

0 comments

read the original abstract

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
cs.CV 2025-11 unverdicted novelty 7.0

SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
cs.CV 2026-05 unverdicted novelty 6.0

E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
EgoExo-WM: Unlocking Exo Video for Ego World Models
cs.CV 2026-05 unverdicted novelty 6.0

Converting exocentric video to egocentric format via body-pose extraction and kinematics prior enables training of action-conditioned egocentric world models that improve prediction quality and goal-directed planning.
EgoExo-WM: Unlocking Exo Video for Ego World Models
cs.CV 2026-05 unverdicted novelty 6.0

Method converts exocentric videos to egocentric format via body-pose extraction and kinematics to improve egocentric world-model prediction and planning.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.