pith. sign in

arxiv: 2604.01001 · v2 · pith:PLLONZYPnew · submitted 2026-04-01 · 💻 cs.CV · cs.AI

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

classification 💻 cs.CV cs.AI
keywords egocentricegosimworldinteractionsconsistencydataembodimentexisting
0
0 comments X
read the original abstract

We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

    cs.CV 2026-05 unverdicted novelty 7.0

    Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...

  2. EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

    cs.CV 2026-05 unverdicted novelty 7.0

    EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks ...

  3. HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

    cs.CV 2026-07 unverdicted novelty 6.0

    HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.

  4. AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

    cs.CV 2026-06 unverdicted novelty 5.0

    AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.

  5. EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

    cs.CV 2026-05 unverdicted novelty 5.0

    EgoInteract is a simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions that produces training data improving model performance on real-world benchmarks f...