EgoExo-WM: Unlocking Exo Video for Ego World Models

Danny Tran; Kristen Grauman; Roberto Mart\'in-Mart\'in

arxiv: 2605.15477 · v2 · pith:WV4JHJDGnew · submitted 2026-05-14 · 💻 cs.CV

EgoExo-WM: Unlocking Exo Video for Ego World Models

Danny Tran , Roberto Mart\'in-Mart\'in , Kristen Grauman This is my paper

Pith reviewed 2026-05-19 14:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric world modelsexocentric videovideo transformationbody pose extractionaction-conditioned predictionrobot planningvisual goal reaching

0 comments

The pith

Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the shortage of egocentric video data for world models by developing a conversion process that turns plentiful exocentric footage into aligned egocentric training examples. It extracts body poses from the exo videos to represent actions and applies a transformation guided by human kinematics to generate ego perspectives. A reader would care because this conversion could expand usable training data dramatically, leading to world models that better predict future states and plan sequences of body poses to reach visual goals. If the approach holds, agents could draw on in-the-wild videos to improve performance in prediction and planning tasks without needing massive new egocentric recordings.

Core claim

Extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior, unlocks the integration of in-the-wild exocentric data for egocentric world model training, with the result that training whole-body action-conditioned egocentric world models on the converted data significantly improves both prediction quality and downstream planning performance where the sequence of body poses needed to achieve a visual goal state is inferred.

What carries the argument

The exocentric-to-egocentric video transformation that extracts body poses and applies a human kinematics prior to produce action-aligned egocentric training data.

If this is right

Whole-body action-conditioned egocentric world models achieve higher prediction quality when trained on the converted data.
Downstream planning improves by more accurately inferring sequences of body poses that reach a desired visual goal state.
Arbitrary in-the-wild videos become usable as sources for building egocentric world models.
Applications in robot planning and augmented-reality guidance gain from the expanded training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conversion pipeline could be tested on other partial-observability settings where exo-style footage is easier to obtain than ego-style footage.
If the kinematics prior generalizes across body types and camera angles, the method might scale to crowd-sourced video collections without per-video manual alignment.
Planning performance gains might translate to real-robot control loops if the inferred pose sequences are executed on hardware with similar kinematics.

Load-bearing premise

The exocentric-to-egocentric video transformation informed by a human kinematics prior produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream gains are not artifacts of the conversion.

What would settle it

Retraining the world models on the converted data and measuring no gain in future-frame prediction accuracy or in success rate at inferring body-pose sequences for visual goals, relative to models trained only on native egocentric data, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.15477 by Danny Tran, Kristen Grauman, Roberto Mart\'in-Mart\'in.

**Figure 1.** Figure 1: Overview. Egocentric video provides an embodied view but often hides the body and occludes the hands, while exocentric video often reveals full-body motion (a). EgoExo-WM uses recovered 3D human motion as a bridge for learning an egocentric world-model with exocentric video: it defines the action sequence and guides exo-to-ego synthesis into actionaligned egocentric observations (b). The learned world mod… view at source ↗

**Figure 2.** Figure 2: World Model Training. EgoExo-WM unlocks exocentric video for egocentric world model training. Given an exocentric video, we recover 3D human motion which we use alongside the original video to ground our exo-to-ego conversion. The 3D human motion becomes our actions and the converted exocentric video becomes the egocentric observation. We then train EgoExo-WM autoregressively with teacher forcing. We apply… view at source ↗

**Figure 3.** Figure 3: EgoX-Body Qualitative Comparison. EgoX-Body better grounds generated egocentric video in human motion and interaction structure. On the egocentric side, we introduce an egocentric hand kinematics conditioning to directly reflect the interactions that define egocentric video. We condition the model with a drawn hand-skeleton overlay that exposes hand kinematics, helping generate consistent hand motion an… view at source ↗

**Figure 4.** Figure 4: EgoX-Body Inference Overview. From exocentric videos, we extract body pose and lift the scene into a 3D point cloud. The body skeleton is overlaid onto the exocentric video, while the same pose and geometry are used to render an egocentric prior with predicted hand locations. We form two latent inputs: (1) the clean exocentric latent concatenated with noise, and (2) the body-overlaid exocentric latent conc… view at source ↗

**Figure 5.** Figure 5: Qualitative planning results. From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. In the first example, the goal is to move left toward the sink, whereas in the second, the goal is to pour cereal. EgoExo-WM chooses trajectories that better match the ground-truth behav… view at source ↗

**Figure 6.** Figure 6: Examples of failure cases in Internet ego-view videos. We show representative clips with large white or black regions, which commonly arise from videos where the person is directly facing the camera where the egocentric prior is essentially black. These failure cases provide little useful training signal, motivating the automatic filtering criteria described in Section A.3.3. We retain clips satisfying bla… view at source ↗

read the original abstract

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes converting exocentric videos to egocentric ones via body pose extraction and a kinematics prior to train better action-conditioned world models, but the abstract asserts gains without any metrics or validation.

read the letter

The key point is that this work gives a concrete pipeline to pull body poses from abundant exocentric video and warp it into egocentric views using a human kinematics prior, then uses that data to train whole-body world models that improve prediction and goal-conditioned planning. The new element is this specific exo-to-ego bridging step that aligns action representations and unlocks in-the-wild footage for ego training. It frames the data scarcity problem cleanly and points to practical uses in robot planning and AR guidance. The approach builds on existing pose tools and priors, which keeps the method grounded and potentially reproducible if the implementation details hold up. The soft spots sit in the evaluation. The abstract states significant improvements in prediction quality and planning performance but supplies no numbers, baselines, ablations, or error breakdowns, so it is impossible to tell whether the gains are real or tied to artifacts from the conversion. The central assumption—that the transformed videos preserve visual statistics, action alignment, and ego-specific effects like camera motion and occlusions—needs direct checks such as distribution matching against real ego data. Without those, the planning results on inferred pose sequences could reflect the synthetic proxy rather than genuine unlocking of exo video. This paper is aimed at researchers building embodied world models who need more training data. Readers working on cross-view adaptation or data augmentation would get the most from the pipeline description. It shows clear thinking on the bottleneck even if the evidence is still thin. I would send it to peer review so the authors can add the missing quantitative validation and address the fidelity of the conversion.

Referee Report

2 major / 2 minor

Summary. The paper presents EgoExo-WM, a method that extracts structured body poses from abundant exocentric videos to represent actions and transforms the videos into egocentric views using a human kinematics prior. This converted data is then used to train whole-body action-conditioned egocentric world models, with the authors claiming significant gains in prediction quality and in downstream planning where the model infers sequences of body poses to reach a specified visual goal state.

Significance. If the conversion process faithfully preserves visual statistics and action alignment with real egocentric observations, the work could meaningfully expand training resources for egocentric world models beyond current data-scarce regimes. This would support stronger embodied prediction and planning systems with applications in robotics and augmented-reality guidance.

major comments (2)

[Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.
[Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.

minor comments (2)

[Introduction] Define 'whole-body action-conditioned' and the precise action representation (pose sequences) more explicitly in the introduction to prevent reader ambiguity.
[Figures] Figure captions and pipeline diagrams should include explicit labels for the kinematics prior application and the action extraction step for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the validation of the video transformation and the experimental analysis.

read point-by-point responses

Referee: [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.

Authors: We agree that quantitative fidelity metrics would provide stronger direct evidence that the synthesized views align with real egocentric statistics. The current manuscript relies on downstream task improvements as indirect validation of the kinematics prior. In the revision we will add distribution-matching results (e.g., FID or perceptual metrics) between synthesized and real egocentric videos as well as an ablation comparing models trained on converted data versus real ego corpora. revision: yes
Referee: [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.

Authors: The experiments section already reports quantitative prediction and planning metrics with several baselines. To make the contribution of the exo-to-ego conversion explicit, we will add (i) a real-ego-only baseline, (ii) an ablation that trains the world model with and without the converted data, and (iii) error analysis broken down by action type and prediction horizon in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external evaluation of converted data.

full rationale

The paper describes an exocentric-to-egocentric video conversion step that uses a human kinematics prior to extract body poses and synthesize ego views, then trains action-conditioned world models on the resulting data and reports measured improvements in prediction quality and planning performance. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains to quantities defined by construction within the same paper; the results are presented as outcomes of training and downstream evaluation on the transformed corpus. The derivation chain therefore remains self-contained against the external benchmarks and real-world planning tasks referenced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method implicitly depends on the existence and applicability of a human kinematics prior and on body pose being a sufficient action representation.

axioms (1)

domain assumption A human kinematics prior exists that can accurately map exocentric body poses and visuals into corresponding egocentric observations without introducing systematic bias for world-model training.
The abstract states that the transformation is 'informed by a human kinematics prior' and treats this step as the bridge that unlocks exo data.

pith-pipeline@v0.9.0 · 5716 in / 1398 out tokens · 56409 ms · 2026-05-19T14:28:18.801641+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wrist-position consistency objective... Lwrist

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.