pith. sign in

arxiv: 2606.18558 · v1 · pith:H7JPKOUHnew · submitted 2026-06-17 · 💻 cs.CV

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Pith reviewed 2026-06-26 21:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D point trajectoriesmotion forecastinglanguage-conditioned predictiongoal-conditioned forecastingrobot manipulation transfervideo motion guidancepoint motion benchmark
0
0 comments X

The pith

A model forecasts future 3D trajectories of object points given short video history and language goal descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that 3D points in world coordinates offer a class-agnostic and view-stable way to represent object motion. It defines goal-conditioned 3D point motion forecasting as the task of predicting future point positions from visual history, selected query points, and a language instruction. To support the task it releases a dataset of 1.16 million annotated trajectories and a benchmark covering 111 object categories. The resulting model, which can generate trajectories either autoregressively or via flow matching, beats prior motion predictors on the benchmark. The same 3D motion prior also speeds up robot manipulation training and supplies motion cues that improve realism in generated videos.

Core claim

Given a short visual history, a set of 3D query points on an object, and a language description of the intended goal, MolmoMotion predicts the future 3D trajectory of each point; the learned prior transfers to robot manipulation by improving training efficiency and generalization and supplies motion guidance that lets generative models produce videos with more realistic object motion.

What carries the argument

Goal-conditioned 3D point motion forecasting that maps visual history plus language goal to future point trajectories via either autoregressive coordinate prediction or flow-matching generation.

If this is right

  • The model outperforms existing motion prediction baselines on PointMotionBench across 111 object categories and 61 motion types.
  • The learned 3D motion prior raises training efficiency and generalization when applied to robot manipulation policies.
  • Predicted trajectories supply motion guidance that lets video generative models synthesize sequences with more realistic object motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same point-based representation could support planning in multi-object scenes once interaction terms are added.
  • Because the output is a set of 3D trajectories rather than pixel flow, it may integrate directly with physics simulators for verification.
  • Scaling the dataset further could allow zero-shot transfer to novel motion types not present in the current 61-category benchmark.

Load-bearing premise

3D points extracted from unconstrained videos remain a sufficiently general and view-stable representation for forecasting and for the claimed transfers to manipulation and video synthesis without extra object-specific modeling.

What would settle it

A controlled test in which predicted trajectories produce no measurable gain in robot task success rate or no reduction in motion artifacts in synthesized videos would show the representation does not transfer as claimed.

read the original abstract

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the task of goal-conditioned 3D point motion forecasting, where a model predicts future 3D trajectories of query points on an object given visual history and a language goal description. It presents MolmoMotion-1M, a corpus of 1.16M video-derived 3D point trajectories with action descriptions; PointMotionBench, a human-verified benchmark across 111 categories and 61 motion types; and MolmoMotion, a model supporting autoregressive and flow-matching trajectory prediction. The work claims that MolmoMotion significantly outperforms motion prediction baselines on the benchmark and that the learned 3D motion prior transfers to improve robot manipulation training and to guide more realistic object motion in video synthesis.

Significance. If the quantitative claims hold with proper controls, the work would establish 3D point trajectories as a compact, class-agnostic representation for scalable motion forecasting, potentially benefiting robotics and generative video models by providing a view-stable prior learned from large-scale video data.

major comments (2)
  1. [Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.
  2. [Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our dataset construction and experimental presentation. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.

    Authors: We agree this is a substantive gap. While the 3D point representation is derived from standard monocular reconstruction pipelines, the manuscript does not include explicit error quantification or viewpoint-invariance tests. In the revision we will add a dedicated analysis subsection that reports depth/pose error statistics on held-out sequences with available ground truth, plus controlled experiments that re-render the same trajectories from novel viewpoints to measure consistency. These additions will directly support the view-stability claim and allow readers to assess potential corpus biases. revision: yes

  2. Referee: [Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.

    Authors: The full manuscript contains a complete Experiments section (Section 4) with quantitative tables, per-category metrics, error bars from repeated runs, ablation studies, and statistical significance tests. If these elements were not visible in the review copy, we will reformat and prominently place all tables, figures, and statistical details in the revised submission so that the magnitude and reliability of the reported gains are fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: model, dataset, and benchmark are externally constructed and evaluated

full rationale

The paper defines a new forecasting task, constructs MolmoMotion-1M from 1.16M external videos, introduces a human-verified PointMotionBench spanning 111 categories, and trains MolmoMotion to outperform baselines on it. No equations, parameter-fitting steps, or self-citation chains are described that would reduce any claimed prediction to an input by construction. The 3D point representation is presented as an argued modeling choice whose utility is tested empirically rather than assumed via self-reference. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. All modeling choices, loss functions, and annotation assumptions remain unstated.

pith-pipeline@v0.9.1-grok · 5841 in / 1275 out tokens · 18602 ms · 2026-06-26T21:26:01.630731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 1 canonical work pages

  1. [1]

    Agarwal, A

    N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Assran, A

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. Robert Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervised...

  3. [3]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  4. [4]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Banerjee, S

    P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025

  6. [6]

    Bharadhwaj, D

    H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  7. [7]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision (ECCV), 2024

  8. [8]

    H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation, 2025. URLhttps://arxiv.org/abs/2507.23523

  9. [9]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    Bousmalis, G

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V. Dalibard, M. Zambelli, M. Martins, R. Pevceviciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. Żołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rothörl,...

  11. [11]

    Bruce, M

    J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments.arXiv preprint arXiv:...

  12. [12]

    Carion, L

    N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko...

  13. [13]

    H. Chen, B. Sun, A. Zhang, M. Pollefeys, and S. Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672, 2025

  14. [14]

    L.-H. Chen, J. Zhang, Y. Li, Y. Pang, X. Xia, and T. Liu. Humanmac: Masked motion completion for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9544–9555, 2023

  15. [15]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URLhttps://arxiv.org/abs/2303.04137. 12

  16. [16]

    Clark, Y

    C. Clark, Y. Yang, J. S. Park, Z. Ma, J. Zhang, R. Tripathi, M. Salehi, S. Lee, T. Anderson, W. Han, et al. Molmopoint: Better pointing for vlms with grounding tokens.arXiv preprint arXiv:2603.28069, 2026

  17. [17]

    Clark, J

    C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y. Yang, V. Shao, Y. Yang, W. Huang, Z. Gao, T. Anderson, J. Zhang, J. Jain, G. Stoica, W. Han, A. Farhadi, and R. Krishna. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026

  18. [18]

    Deshpande, M

    A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y. Kim, Q. Pfeifer, Y.-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. VanderBilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmobot: Large-scale simulation enables zero-shot manipu...

  19. [19]

    Dharmarajan, W

    K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

  20. [20]

    Doersch, A

    C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35: 13610–13626, 2022

  21. [21]

    Doersch, Y

    C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement.arXiv preprint arXiv:2306.08637, 2023

  22. [22]

    Doersch, P

    C. Doersch, P. Luc, Y. Yang, D. Gokay, S. Koppula, A. Gupta, J. Heyward, I. Rocco, R. Goroshin, J. Carreira, and A. Zisserman. Bootstap: Bootstrapped training for tracking-any-point.arXiv preprint arXiv:2402.00847, 2024

  23. [23]

    H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y. R. Wang, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

  24. [24]

    H. Feng, J. Zhang, Q. Wang, Y. Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152, 2025

  25. [25]

    Finn and S

    C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

  26. [26]

    Garrido, T

    Q. Garrido, T. Nagarajan, B. Terver, N. Ballas, Y. LeCun, and M. Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026

  27. [27]

    Gibson.The Ecological Approach to Visual Perception

    J. Gibson.The Ecological Approach to Visual Perception. Resources for ecological psychology. Lawrence Erlbaum Associates, 1986. ISBN 9780898599596. URLhttps://books.google.com/books?id=DrhCCWmJpWUC

  28. [28]

    R. G. Goswami, A. Bar, D. Fan, T.-Y. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y. LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025

  29. [29]

    Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, W. Wang, and Y. Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

  30. [30]

    A. W. Harley, Y. You, X. Sun, Y. Zheng, N. Raghuraman, Y. Gu, S. Liang, W.-H. Chu, A. Dave, P. Tokmakov, S. You, R. Ambrus, K. Fragkiadaki, and L. J. Guibas. Alltracker: Efficient dense point tracking at high resolution. arXiv preprint arXiv:2506.07310, 2025

  31. [31]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  32. [32]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  33. [33]

    Huang, Q

    J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taixé, and S. Fidler. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

  34. [34]

    Huang, Y.-W

    W. Huang, Y.-W. Chao, A. Mousavian, M.-Y. Liu, D. Fox, K. Mo, and L. Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026. 13

  35. [35]

    Huang, Y

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  36. [36]

    L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

  37. [37]

    Karaev, I

    N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024

  38. [38]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. Kumar, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  39. [39]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  40. [40]

    Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, A. Guru, P. Wolters, A. Herrasti, Y.-C. Lee, G. Chalvatzaki, Y. Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot naviga...

  41. [41]

    Koppula, I

    S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems, 37:82149–82165, 2024

  42. [42]

    C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychology, 49(4):764–766, 2013

  43. [43]

    G. Li, Y. Lyu, Z. Liu, C. Hou, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

  44. [44]

    Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids.arXiv preprint arXiv:1810.01566, 2018

  45. [45]

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

  46. [46]

    Lipman, R

    Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  47. [47]

    Y. Liu, Y. Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022

  48. [48]

    Z. Liu, S. Li, E. Cousineau, S. Feng, B. Burchfiel, and S. Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

  49. [49]

    B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InProceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc

  50. [50]

    W. Mao, M. Liu, M. Salzmann, and H. Li. Learning trajectory dependencies for human motion prediction, 2020. URLhttps://arxiv.org/abs/1908.05436

  51. [51]

    Mendonca, S

    R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

  52. [52]

    Transformersaresample-efficientworldmodels

    V.Micheli, E.Alonso, andF.Fleuret. Transformersaresample-efficientworldmodels. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=vhFu1Acb0xb

  53. [53]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 14

  54. [54]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration

  55. [55]

    IEEE, 2024

    In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  56. [56]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  57. [57]

    Perrett, A

    T. Perrett, A. Darkhalil, S. Sinha, O. Emara, S. Pollard, K. K. Parida, K. Liu, P. Gatti, S. Bansal, K. Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

  58. [58]

    Pont-Tuset, F

    J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

  59. [59]

    N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  60. [60]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations

    Ropedia AI. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations. https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026. Hugging Face dataset

  61. [61]

    Sanchez-Gonzalez, J

    A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. InInternational conference on machine learning, pages 8459–8468. PMLR, 2020

  62. [62]

    Soraki, H

    R. Soraki, H. Bharadhwaj, A. Farhadi, and R. Mottaghi. Objectforesight: Predicting future 3d object trajectories from human videos.arXiv preprint arXiv:2601.05237, 2026

  63. [63]

    J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

  64. [64]

    Teed and J

    Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020

  65. [65]

    Thakkar, S

    N. Thakkar, S. Ginosar, J. Walker, J. Malik, J. Carreira, and C. Doersch. Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026

  66. [66]

    V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, 2019

  67. [67]

    Ullman.The Interpretation of Visual Motion

    S. Ullman.The Interpretation of Visual Motion. The MIT Press, 03 1979. ISBN 9780262257121. doi: 10.7551/ mitpress/3877.001.0001. URLhttps://doi.org/10.7551/mitpress/3877.001.0001

  68. [68]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  69. [69]

    B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, and S. Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 606–617, 2023

  70. [70]

    Spatialtrackerv2: 3d point tracking made easy

    Y.Xiao, J.Wang, N.Xue, N.Karaev, Y.Makarov, B.Kang, X.Zhu, H.Bao, Y.Shen, andX.Zhou. Spatialtrackerv2: 3d point tracking made easy. InProceedings of the IEEE/CVF International Conference on Computer Vision,

  71. [71]

    URLhttps://arxiv.org/abs/2507.12462

  72. [72]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  73. [73]

    L. Yang, Y. Fan, and N. Xu. Video instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5188–5197, 2019

  74. [74]

    R. Yang, Q. Yu, Y. Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y. Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  75. [75]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  76. [76]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y.-W. Chao, B. Y. Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Representations (ICLR), 2025. 15

  77. [77]

    Yoshida, S

    T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Generating 6dof object manipulation trajectories from action description in egocentric vision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17370–17382, 2025

  78. [78]

    Zhang, G

    C. Zhang, G. L. Moing, S. Koppula, I. Rocco, L. Momeni, J. Xie, S. Sun, R. Sukthankar, J. K. Barral, R. Hadsell, Z. Ghahramani, A. Zisserman, J. Zhang, and M. S. M. Sajjadi. Efficiently reconstructing dynamic scenes one d4rt at a time, 2025. URLhttps://arxiv.org/abs/2512.08924

  79. [79]

    G. Zhou, H. Pan, Y. Lecun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, p...

  80. [80]

    pick up red ceramic coffee mug

    H. Zhou, J. Cao, L. Ma, X. Fang, and G. jun Qi. Traj2action: A co-denoising framework for trajectory-guided human-to-robot skill transfer, 2026. URLhttps://arxiv.org/abs/2510.00491. 16 Appendix A Qualitative examples................................................................................ 18 B MolmoMotion-1M Data Generation Details....................