pith. sign in

arxiv: 2605.26316 · v1 · pith:IL5G76PRnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI

E³C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

Pith reviewed 2026-06-29 22:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords egocentric video generationvideo diffusion3D memorypoint cloud conditioninghuman pose controlego-exo controlscene consistencycontrollable video synthesis
0
0 comments X

The pith

E³C generates egocentric videos by conditioning a diffusion model on rendered 3D point cloud memory while controlling ego and exo human poses separately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces E³C as a controllable video diffusion framework for egocentric generation, where the camera moves with the actor. It builds a semi-dense point cloud 3D memory from context frames, augments points with appearance descriptors from video-VAE features, and renders the memory into target viewpoints to condition the model on persistent scene structure. Human dynamics are modeled apart, with skeleton renderings for other people in the scene and 3D body joints plus 6DoF wrist motion for the camera wearer, supported by an ego motion encoder that supplies persistent cross-attention tokens when body parts are invisible. Experiments on the Nymeria dataset show gains in visual fidelity, camera motion accuracy, object consistency, and human control over baselines, plus support for scene editing. A sympathetic reader would care because accurate first-person video synthesis helps embodied agents reason about how their actions and others' actions change the surrounding world.

Core claim

E³C constructs a semi-dense point cloud-based 3D memory from context frames, with each point augmented by appearance descriptors from video-VAE features. Rendering this memory into target viewpoints supplies conditioning signals aligned with the desired frames. Ego human control is specified through 3D body joints and wrist motion, with an ego motion encoder creating persistent tokens to maintain control during occlusions, while exo human control uses skeleton renderings. This separation of persistent scene structure from human-driven dynamics produces higher visual fidelity, more accurate camera motion and object consistency, stronger ego and exo pose control, and intuitive scene editing on

What carries the argument

The semi-dense point cloud-based 3D memory, rendered into target viewpoints to provide conditioning that disentangles persistent scene structure from human dynamics.

If this is right

  • Generated videos exhibit improved object consistency and visual fidelity across frames with large viewpoint shifts.
  • Camera motion accuracy and both ego and exo human pose control increase relative to baselines that lack separate memory and control pathways.
  • Scene editing becomes feasible by direct modification of the underlying 3D memory points before rendering.
  • Ego control remains effective even when the camera wearer's body parts leave the visible frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Incremental updates to the point cloud memory could support generation of video sequences longer than the initial context window.
  • The same memory-plus-pose separation might transfer to predicting future views for navigation or planning in robotic systems.
  • Multiple independent ego and exo control signals could extend the method to multi-person interaction videos.

Load-bearing premise

The semi-dense point cloud 3D memory built from context frames, when rendered to new viewpoints, supplies conditioning signals that preserve persistent scene structure despite rapid viewpoint changes, self-occlusions, and subtle articulated actions.

What would settle it

Generated video sequences that display drifting object locations or mismatched appearances in regions of rapid camera motion around self-occluded body parts, when compared frame-by-frame against ground-truth Nymeria recordings.

Figures

Figures reproduced from arXiv: 2605.26316 by Adam W Harley, Florian Shkurti, Julian Straub, Lingni Ma, Qiao Gu, Richard Newcombe.

Figure 1
Figure 1. Figure 1: We present E ³ C , a controllable egocentric video generation model that unifies 3D Environmental memory with Ego– Exo human pose Control. E ³ C generates ego￾centric videos that maintain 3D scene consistency (top left), follow ego-body motion (top right), and adhere to exo-human motion (bottom). All images are from generated videos. Project page: e3c-videogen.github.io . Abstract. Controllable and physica… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of E³C. A latent video DiT iteratively denoises noisy video latents conditioned on a text prompt, view-aligned renderings of a semi-dense point cloud￾based 3D memory augmented with VAE appearance features, and ego–exo human pose controls. The rendered memory is injected via the context adapter, while pose controls are provided both as pose drawings in the conditioning video and as persistent ego p… view at source ↗
Figure 3
Figure 3. Figure 3: Ego pose encoder. The encoder maps ego 3D body joints p e,s and 6DoF wrist poses p e,w to motion-aware tokens He using spatial and temporal transformers and gated fusion. He is then used for pose cross-attention ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Each row corresponds to one video snippet and columns show different timestamps. In each cell, the top row shows the conditioning visualization (left) and the ground-truth frame (right), and the bottom row shows the generated frame from E³C. E³C maintains consistent 3D scenes (row 1), follows ego-body motion (row 2), and adheres to exo-human motion (rows 3–4). See full videos on the pr… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablations. Each row shows the GT frame and the corresponding pre￾dictions with and without each type of condition. Top: removing per-point appearance features (Feat) increases texture and color drift. Middle: removing exo pose control (Exo) harms exo-motion adherence. Bottom: removing encoded ego pose tokens (Ego) reduces ego hand/body fidelity under self-occlusion. encoder branch, and uses the… view at source ↗
Figure 6
Figure 6. Figure 6: Editing by manipulating explicit 3D memory and pose conditions. Left four columns: object removal by deleting the object’s points from the 3D memory, which consistently removes it from generated frames. Right four columns: exo-person removal by removing the corresponding pose control. Both conditioning and generated frames before and after the edits are shown. See full videos on the project page. while fol… view at source ↗
read the original abstract

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents E³C, a controllable video diffusion framework for egocentric video generation. From context frames it constructs a semi-dense point cloud-based 3D memory augmented with video-VAE appearance descriptors; this memory is rendered into target viewpoints to supply conditioning signals. Human dynamics are handled separately via skeleton renderings (exo control) and 3D body joints plus 6DoF wrist motion (ego control), with an additional ego motion encoder producing persistent cross-attention tokens for cases where the wearer's body is occluded. Experiments on the Nymeria dataset are reported to show gains in visual fidelity, camera-motion accuracy, object consistency, and ego/exo human control relative to strong baselines, together with the ability to perform intuitive scene editing.

Significance. If the empirical claims hold, the work offers a practical disentanglement of persistent 3D scene structure from articulated human dynamics, directly addressing the rapid viewpoint changes, self-occlusions, and partial visibility that characterize egocentric video. The combination of rendered 3D memory with separate pose controls and the editing capability would be a useful contribution to controllable video synthesis for embodied agents.

major comments (1)
  1. [Abstract] Abstract: the central empirical claim—that E³C improves visual fidelity, camera-motion accuracy, object consistency, and ego/exo human control on Nymeria—is stated without any quantitative numbers, metric definitions, baseline descriptions, error bars, dataset statistics, or implementation details. This absence prevents verification that the reported gains are supported by the method or data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract and for the positive overall assessment of the work. We address the point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim—that E³C improves visual fidelity, camera-motion accuracy, object consistency, and ego/exo human control on Nymeria—is stated without any quantitative numbers, metric definitions, baseline descriptions, error bars, dataset statistics, or implementation details. This absence prevents verification that the reported gains are supported by the method or data.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version we will expand the final sentence of the abstract to report the key numerical improvements (e.g., relative gains in FID, camera-motion error, object-consistency metrics, and control accuracy) together with the names of the primary baselines, the main evaluation metrics, and a brief note on the Nymeria split used. Error bars and full implementation details will remain in the experimental section, as is conventional, but the abstract will now contain enough concrete numbers to allow immediate verification of the stated gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a video diffusion model that builds a semi-dense 3D point-cloud memory from context frames, renders it for target views, and applies separate skeleton-based controls for exo and ego human motion. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs themselves. The central claims rest on empirical results on the Nymeria dataset rather than any self-referential derivation or self-citation chain. The provided text contains no load-bearing self-citations, uniqueness theorems, or ansatzes that collapse the method onto its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, fitting procedures, or background assumptions; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5811 in / 1193 out tokens · 36699 ms · 2026-06-29T22:22:02.877725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 34 canonical work pages · 14 internal anchors

  1. [1]

    Gated Multimodal Units for Information Fusion

    Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017) 8, 24

  2. [2]

    Layer Normalization

    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 23

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera con- trol in video diffusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22875–22889 (2025) 4

  4. [4]

    Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Vd3d: Taming large video diffusion transformers for 3d camera control. Proc. ICLR (2025) 4

  5. [5]

    arXiv preprint arXiv:2506.21552 (2025) 2, 4, 9, 11, 13, 26

    Bai, Y., Tran, D., Bar, A., LeCun, Y., Darrell, T., Malik, J.: Whole-body condi- tioned egocentric video prediction. arXiv preprint arXiv:2506.21552 (2025) 2, 4, 9, 11, 13, 26

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025) 26 16 Q. Gu et al

  7. [7]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Cao, C., Zhou, J., Li, S., Liang, J., Yu, C., Wang, F., Xue, X., Fu, Y.: Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video gen- eration. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025) 2, 4, 7, 8

  8. [8]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025) 10, 27

  9. [9]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020) 8, 24

  10. [10]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27934–27945 (2025) 4

  11. [11]

    arXiv preprint arXiv:2511.18991 (2025) 4

    Danier, D., Gao, G., McDonagh, S., Li, C., Bilen, H., Mac Aodha, O.: View- consistent diffusion representations for 3d-consistent video generation. arXiv preprint arXiv:2511.18991 (2025) 4

  12. [12]

    VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

    Du, H., Ye, J., Cong, X., Li, R., Ni, J., Agarwal, A., Zhou, Z., Li, Z., Balestriero, R., Wang, Y.: Videogpa: Distilling geometry priors for 3d-consistent video generation. arXiv preprint arXiv:2601.23286 (2026) 4

  13. [13]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Ta- lattof, A., Yuan, A., Souti, B., Meredith, B., Peng, C., Sweeney, C., Wilson, C., Barnes, D., DeTone, D., Caruso, D., Valleroy, D., Ginjupalli, D., Frost, D., Miller, E., Mueggler, E., Oleinik, E., Zhang, F., Somasundaram, G., Solaira, G., Lanaras, H., Howard-Jenkins, H., Tang,...

  14. [14]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Escobar, M., Puentes, J., Forigua, C., Pont-Tuset, J., Maninis, K.K., Arbelaez, P.: Egocast: Forecasting egocentric human pose in the wild. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 5831–5841. IEEE (2025) 2

  15. [15]

    In: Forty-first international conference on machine learning (2024) 5

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 5

  16. [16]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Fang, I., Chen, Y., Wang, Y., Zhang, J., Zhang, Q., Xu, J., He, X., Gao, W., Su, H., Li, Y., et al.: Egopat3dv2: Predicting 3d action target from 2d egocentric vision for human-robot interaction. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3036–3043. IEEE (2024) 2

  17. [17]

    In: 2017 IEEE international conference on robotics and automation (ICRA)

    Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: 2017 IEEE international conference on robotics and automation (ICRA). pp. 2786–2793. IEEE (2017) 2

  18. [18]

    E³C: Video Generation with 3D Environmental Memory 17 In: Advances in Neural Information Processing Systems

    Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream- sim: Learning new dimensions of human visual similarity using synthetic data. E³C: Video Generation with 3D Environmental Memory 17 In: Advances in Neural Information Processing Systems. vol. 36, pp. 50742–50768 (2023) 13

  19. [19]

    YOLOX: Exceeding YOLO Series in 2021

    Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021) 9, 10

  20. [20]

    In: European Conference on Computer Vision

    Gu, Q., Lv, Z., Frost, D., Green, S., Straub, J., Sweeney, C.: Egolifter: Open-world 3d segmentation for egocentric perception. In: European Conference on Computer Vision. pp. 382–400. Springer (2024) 9

  21. [21]

    In: 2025 International Conference on 3D Vision (3DV)

    Guzov, V., Jiang, Y., Hong, F., Pons-Moll, G., Newcombe, R., Liu, C.K., Ye, Y., Ma, L.: Hmd 2: Environment-aware motion generation from single egocentric head-mounted device. In: 2025 International Conference on 3D Vision (3DV). pp. 1394–1405. IEEE (2025) 9

  22. [22]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024) 2, 4, 10

  23. [23]

    Advances in neural information processing systems30(2017) 13

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 13

  24. [24]

    Hou,C.,Chen,Z.:Training-freecameracontrolforvideogeneration.arXivpreprint arXiv:2406.10126 (2024) 2, 4

  25. [25]

    ICLR1(2), 3 (2022) 10

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022) 10

  26. [26]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev, D., Lin, C.H., Ren, J., Xie, K., Biswas, J., Leal-Taixe, L., Fidler, S.: Vipe: Video pose engine for 3d geometric perception. In: NVIDIA Research Whitepapers arXiv:2508.10934 (2025) 10, 24, 27

  27. [27]

    ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4

    Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4

  28. [28]

    VACE: All-in-One Video Creation and Editing

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598 (2025) 3, 5, 6, 7, 8, 9, 10, 11, 21, 22

  29. [29]

    arXiv preprint arXiv:2512.08269 (2025) 5

    Kang, T., Kim, K., Kim, D., Park, M., Hyung, J., Choo, J.: Egox: Egocentric video generation from a single exocentric video. arXiv preprint arXiv:2512.08269 (2025) 5

  30. [30]

    ACM Transactions on Graphics (ToG)42(4), 1–14 (2023) 26

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG)42(4), 1–14 (2023) 26

  31. [31]

    In: Interna- tional Conference on Learning Representations (2015) 10

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna- tional Conference on Learning Representations (2015) 10

  32. [32]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Koh, J.Y., Lee, H., Yang, Y., Baldridge, J., Anderson, P.: Pathdreamer: A world model for indoor navigation. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14738–14748 (2021) 2

  33. [33]

    Naval research logistics quarterly2(1-2), 83–97 (1955) 10

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly2(1-2), 83–97 (1955) 10

  34. [34]

    arXiv preprint arXiv:2510.14945 (2025) 4 18 Q

    Lee, J., Jung, J., Han, J., Narihira, T., Fukuda, K., Seo, J., Hong, S., Mitsufuji, Y., Kim, S.: 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945 (2025) 4 18 Q. Gu et al

  35. [35]

    In: Interna- tional conference on machine learning

    Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: A framework for attention-based permutation-invariant neural networks. In: Interna- tional conference on machine learning. pp. 3744–3753. PMLR (2019) 8, 24

  36. [36]

    arXiv preprint arXiv:2506.18903 (2025) 2, 4, 9, 10, 11, 21, 25

    Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. arXiv preprint arXiv:2506.18903 (2025) 2, 4, 9, 10, 11, 21, 25

  37. [37]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, T., Zheng, G., Jiang, R., Zhan, S., Wu, T., Lu, Y., Lin, Y., Deng, C., Xiong, Y., Chen, M., et al.: Realcam-i2v: Real-world image-to-video generation with in- teractive complex camera control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 28785–28796 (2025) 4

  38. [38]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 5, 10

  39. [39]

    Liu, J.W., Mao, W., Xu, Z., Keppo, J., Shou, M.Z.: Exocentric-to-egocentric video generation.AdvancesinNeuralInformationProcessingSystems37,136149–136172 (2024) 5

  40. [40]

    arXiv preprint arXiv:2403.09194 (2024) 5

    Luo, H., Zhu, K., Zhai, W., Cao, Y.: Intention-driven ego-to-exo video generation. arXiv preprint arXiv:2403.09194 (2024) 5

  41. [41]

    In: European Conference on Computer Vision

    Luo, M., Xue, Z., Dimakis, A., Grauman, K.: Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. In: European Conference on Computer Vision. pp. 407–425. Springer (2024) 5

  42. [42]

    In: European Conference on Computer Vision

    Ma, L., Ye, Y., Hong, F., Guzov, V., Jiang, Y., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V., Kim, H.J., et al.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In: European Conference on Computer Vision. pp. 445–465. Springer (2024) 3, 9

  43. [43]

    arXiv preprint arXiv:2511.20186 (2025) 5

    Mahdi, M., Fu, Y., Savov, N., Pan, J., Paudel, D.P., Van Gool, L.: Exo2egosyn: Unlocking foundation video generation models for exocentric-to-egocentric video synthesis. arXiv preprint arXiv:2511.20186 (2025) 5

  44. [44]

    Meta Reality Labs Research: Project aria machine perception services.https: //facebookresearch.github.io/projectaria_tools/docs/ARK/mps(2023), ac- cessed: 2026-01-27 9, 10, 21, 24, 26

  45. [45]

    Advances in Neural Information Processing Systems36, 72983–73007 (2023) 10, 27

    Minderer, M., Gritsenko, A., Houlsby, N.: Scaling open-vocabulary object detec- tion. Advances in Neural Information Processing Systems36, 72983–73007 (2023) 10, 27

  46. [46]

    movella.com/products/motion-capture/xsens-mvn-link(2023), accessed: 2026- 01-27 9

    Movella: Movella Xsens MVN Link Motion Capture System.https : / / www . movella.com/products/motion-capture/xsens-mvn-link(2023), accessed: 2026- 01-27 9

  47. [47]

    nerf.studio/en/latest/nerfology/methods/splatfacto.html(2023), part of the Nerfstudio framework 9, 10, 11, 21, 26

    Nerfstudio Team: Splatfacto: Splatting for fast radiance fields.https://docs. nerf.studio/en/latest/nerfology/methods/splatfacto.html(2023), part of the Nerfstudio framework 9, 10, 11, 21, 26

  48. [48]

    arXiv preprint arXiv:2511.18173 (2025) 2, 4, 9, 10, 11, 13, 26

    Pallotta, E., Azar, S.M., Doorenbos, L., Ozsoy, S., Iqbal, U., Gall, J.: Egocontrol: Controllable egocentric video generation via 3d full-body poses. arXiv preprint arXiv:2511.18173 (2025) 2, 4, 9, 10, 11, 13, 26

  49. [49]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 5, 6

  50. [50]

    arXiv preprint arXiv:2506.09042 (2025) 2 E³C: Video Generation with 3D Environmental Memory 19

    Ren, X., Lu, Y., Cao, T., Gao, R., Huang, S., Sabour, A., Shen, T., Pfaff, T., Wu, J.Z., Chen, R., et al.: Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models. arXiv preprint arXiv:2506.09042 (2025) 2 E³C: Video Generation with 3D Environmental Memory 19

  51. [51]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video gener- ation with precise camera control. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6121–6132 (2025) 2, 4, 7, 9, 10, 11, 21, 24

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shuai, X., Ding, H., Qin, Z., Luo, H., Ma, X., Tao, D.: Free-form motion control: Controlling the 6d poses of camera and objects in video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12449–12458 (2025) 4

  53. [53]

    Highway Networks

    Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015) 8, 24

  54. [54]

    arXiv preprint arXiv:2506.09995 (2025)

    Tu, Y., Luo, H., Chen, X., Bai, X., Wang, F., Zhao, H.: Playerone: Egocentric world simulator. arXiv preprint arXiv:2506.09995 (2025) 2

  55. [55]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018) 10

  56. [56]

    Advances in neural information pro- cessing systems30(2017) 8, 23, 24

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017) 8, 23, 24

  57. [57]

    In: European Conference on Computer Vision

    Voleti, V., Yao, C.H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., Laforte, C., Rombach, R., Jampani, V.: Sv3d: Novel multi-view synthesis and 3d genera- tion from a single image using latent video diffusion. In: European Conference on Computer Vision. pp. 439–457. Springer (2024) 4

  58. [58]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 3, 5, 10, 22

  59. [59]

    arXiv preprint arXiv:2510.01183 (2025) 4

    Wang, J., Ye, L., Lu, T., Xiao, J., Zhang, J., Guo, Y., Liu, X., Chellappa, R., Peng, C., Yuille, A., et al.: Evoworld: Evolving panoramic world generation with explicit 3d memory. arXiv preprint arXiv:2510.01183 (2025) 4

  60. [60]

    arXiv preprint arXiv:2411.08380 (2024) 4

    Wang,X.,Zhao,K.,Liu,F.,Wang,J.,Zhao,G.,Bao,X.,Zhu,Z.,Zhang,Y.,Wang, X.: Egovid-5m: A large-scale video-action dataset for egocentric video generation. arXiv preprint arXiv:2411.08380 (2024) 4

  61. [61]

    IEEE TIP13(4), 600–612 (2004) 10

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP13(4), 600–612 (2004) 10

  62. [62]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024) 2, 4

  63. [63]

    In: The Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025

    Watson, D., Saxena, S., Li, L., Tagliasacchi, A., Fleet, D.J.: Controlling space and time with diffusion models. In: The Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net (2025),https://openreview.net/forum?id=d2UrCGtntF2

  64. [64]

    arXiv preprint arXiv:2506.05284 (2025) 2, 4, 7

    Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284 (2025) 2, 4, 7

  65. [65]

    arXiv preprint arXiv:2407.17470 (2024) 4

    Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d con- tent generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024) 4

  66. [66]

    arXiv preprint arXiv:2508.13013 (2025) 2, 4, 9, 10, 11, 13 20 Q

    Xiu, J., Hong, F., Li, Y., Li, M., Wang, W., Han, S., Pan, L., Liu, Z.: Egotwin: Dreaming body and view in first person. arXiv preprint arXiv:2508.13013 (2025) 2, 4, 9, 10, 11, 13 20 Q. Gu et al

  67. [67]

    CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., Vahdat, A.: Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509 (2024) 4

  68. [68]

    arXiv preprint arXiv:2504.11732 (2025) 5

    Xu, J., Huang, Y., Pei, B., Hou, J., Li, Q., Chen, G., Zhang, Y., Feng, R., Xie, W.: Egoexo-gen: Ego-centric video prediction by watching exo-centric videos. arXiv preprint arXiv:2504.11732 (2025) 5

  69. [69]

    In: Advances in Neural Information Processing Systems (2022) 9, 10

    Xu,Y.,Zhang,J.,Zhang,Q.,Tao,D.:ViTPose:Simplevisiontransformerbaselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022) 9, 10

  70. [70]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Liu, S., Guo, H., Dong, Y., Zhang, X., Zhang, S., Wang, P., Zhou, Z., Xie, B., Wang, Z., et al.: Egolife: Towards egocentric life assistant. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28885–28900 (2025) 2

  71. [71]

    In: ACM SIGGRAPH 2024 Conference Papers

    Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024) 2, 4

  72. [72]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025) 2

  73. [73]

    In: The Thirteenth International Conference on Learning Representations (2025) 2

    Yu, W., Yin, S., Easterbrook, S., Garg, A.: Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning. In: The Thirteenth International Conference on Learning Representations (2025) 2

  74. [74]

    Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view videodiffusionmodelfor4dgeneration.AdvancesinNeuralInformationProcessing Systems37, 15272–15295 (2024) 4

  75. [75]

    arXiv preprint arXiv:2512.04515 (2025) 4

    Zhang, L., Ye, J., Wang, Y., Zhong, M., Cao, M., Xia, W., Zeng, B., Zhang, Z., Tang, H.: Egolcd: Egocentric video generation with long context diffusion. arXiv preprint arXiv:2512.04515 (2025) 4

  76. [76]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 8

  77. [77]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, Q., Zhai, S., Martin, M.A.B., Miao, K., Toshev, A., Susskind, J., Gu, J.: World-consistent video diffusion with explicit 3d modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21685–21695 (2025) 4

  78. [78]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 10, 13

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 10, 13

  79. [79]

    arXiv preprint arXiv:2512.15716 (2025) 4

    Zhao, J., Wei, F., Liu, Z., Zhang, H., Xu, C., Lu, Y.: Spatia: Video generation with updatable spatial memory. arXiv preprint arXiv:2512.15716 (2025) 4

  80. [80]

    arXiv preprint arXiv:2410.15957 (2024) 4

    Zheng, G., Li, T., Jiang, R., Lu, Y., Wu, T., Li, X.: Cami2v: Camera-controlled image-to-video diffusion model. arXiv preprint arXiv:2410.15957 (2024) 4

Showing first 80 references.