MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Chenhao Zheng; Chun-Liang Li; Jiafei Duan; Jianing Zhang; Jieyu Zhang; Max Argus; Ranjay Krishna; Rustin Soraki; Shuo Liu; Taira Anderson

arxiv: 2606.18558 · v1 · pith:H7JPKOUHnew · submitted 2026-06-17 · 💻 cs.CV

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Jianing Zhang , Chenhao Zheng , Yajun Yang , Max Argus , Rustin Soraki , Winson Han , Taira Anderson , Chun-Liang Li

show 5 more authors

Shuo Liu Jiafei Duan Zhongzheng Ren Jieyu Zhang Ranjay Krishna

This is my paper

Pith reviewed 2026-06-26 21:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D point trajectoriesmotion forecastinglanguage-conditioned predictiongoal-conditioned forecastingrobot manipulation transfervideo motion guidancepoint motion benchmark

0 comments

The pith

A model forecasts future 3D trajectories of object points given short video history and language goal descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that 3D points in world coordinates offer a class-agnostic and view-stable way to represent object motion. It defines goal-conditioned 3D point motion forecasting as the task of predicting future point positions from visual history, selected query points, and a language instruction. To support the task it releases a dataset of 1.16 million annotated trajectories and a benchmark covering 111 object categories. The resulting model, which can generate trajectories either autoregressively or via flow matching, beats prior motion predictors on the benchmark. The same 3D motion prior also speeds up robot manipulation training and supplies motion cues that improve realism in generated videos.

Core claim

Given a short visual history, a set of 3D query points on an object, and a language description of the intended goal, MolmoMotion predicts the future 3D trajectory of each point; the learned prior transfers to robot manipulation by improving training efficiency and generalization and supplies motion guidance that lets generative models produce videos with more realistic object motion.

What carries the argument

Goal-conditioned 3D point motion forecasting that maps visual history plus language goal to future point trajectories via either autoregressive coordinate prediction or flow-matching generation.

If this is right

The model outperforms existing motion prediction baselines on PointMotionBench across 111 object categories and 61 motion types.
The learned 3D motion prior raises training efficiency and generalization when applied to robot manipulation policies.
Predicted trajectories supply motion guidance that lets video generative models synthesize sequences with more realistic object motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same point-based representation could support planning in multi-object scenes once interaction terms are added.
Because the output is a set of 3D trajectories rather than pixel flow, it may integrate directly with physics simulators for verification.
Scaling the dataset further could allow zero-shot transfer to novel motion types not present in the current 61-category benchmark.

Load-bearing premise

3D points extracted from unconstrained videos remain a sufficiently general and view-stable representation for forecasting and for the claimed transfers to manipulation and video synthesis without extra object-specific modeling.

What would settle it

A controlled test in which predicted trajectories produce no measurable gain in robot task success rate or no reduction in motion artifacts in synthesized videos would show the representation does not transfer as claimed.

read the original abstract

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines language-conditioned 3D point trajectory forecasting and releases a 1.16M-video corpus plus 111-category benchmark, but the outperformance and transfer claims rest on unshown results and shaky extraction assumptions.

read the letter

The main news is the task formalization around goal-conditioned 3D point trajectories in world coordinates, plus the MolmoMotion-1M dataset built from 1.16M unconstrained videos and the PointMotionBench spanning 111 categories and 61 motion types. That scale and the class-agnostic framing are new for motion forecasting work.

The model itself combines autoregressive coordinate prediction with flow-matching generation, and the abstract states it beats existing baselines while transferring to robot manipulation and video synthesis. The data pipeline and the decision to work directly in 3D points rather than category labels or 2D tracks are the parts that feel like genuine forward steps.

The soft spot is the 3D extraction step itself. Pulling stable world-coordinate points from in-the-wild video without calibration or object-specific modeling is known to be noisy, and the stress-test concern about view-stability looks real. No error quantification or invariance checks appear in the provided material, so any systematic bias in the corpus would flow straight into the benchmark numbers and the claimed downstream gains.

This is for people building motion models for robotics or generative video who need a new large-scale 3D benchmark. It deserves peer review because the task definition and data release are substantive even if the model results require the full numbers and ablations to assess.

Referee Report

2 major / 0 minor

Summary. The paper introduces the task of goal-conditioned 3D point motion forecasting, where a model predicts future 3D trajectories of query points on an object given visual history and a language goal description. It presents MolmoMotion-1M, a corpus of 1.16M video-derived 3D point trajectories with action descriptions; PointMotionBench, a human-verified benchmark across 111 categories and 61 motion types; and MolmoMotion, a model supporting autoregressive and flow-matching trajectory prediction. The work claims that MolmoMotion significantly outperforms motion prediction baselines on the benchmark and that the learned 3D motion prior transfers to improve robot manipulation training and to guide more realistic object motion in video synthesis.

Significance. If the quantitative claims hold with proper controls, the work would establish 3D point trajectories as a compact, class-agnostic representation for scalable motion forecasting, potentially benefiting robotics and generative video models by providing a view-stable prior learned from large-scale video data.

major comments (2)

[Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.
[Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our dataset construction and experimental presentation. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.

Authors: We agree this is a substantive gap. While the 3D point representation is derived from standard monocular reconstruction pipelines, the manuscript does not include explicit error quantification or viewpoint-invariance tests. In the revision we will add a dedicated analysis subsection that reports depth/pose error statistics on held-out sequences with available ground truth, plus controlled experiments that re-render the same trajectories from novel viewpoints to measure consistency. These additions will directly support the view-stability claim and allow readers to assess potential corpus biases. revision: yes
Referee: [Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.

Authors: The full manuscript contains a complete Experiments section (Section 4) with quantitative tables, per-category metrics, error bars from repeated runs, ablation studies, and statistical significance tests. If these elements were not visible in the review copy, we will reformat and prominently place all tables, figures, and statistical details in the revised submission so that the magnitude and reliability of the reported gains are fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: model, dataset, and benchmark are externally constructed and evaluated

full rationale

The paper defines a new forecasting task, constructs MolmoMotion-1M from 1.16M external videos, introduces a human-verified PointMotionBench spanning 111 categories, and trains MolmoMotion to outperform baselines on it. No equations, parameter-fitting steps, or self-citation chains are described that would reduce any claimed prediction to an input by construction. The 3D point representation is presented as an argued modeling choice whose utility is tested empirically rather than assumed via self-reference. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. All modeling choices, loss functions, and annotation assumptions remain unstated.

pith-pipeline@v0.9.1-grok · 5841 in / 1275 out tokens · 18602 ms · 2026-06-26T21:26:01.630731+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 1 canonical work pages

[1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. Robert Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervised...

2025
[3]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[5]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025

2025
[6]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024
[7]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[8]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation, 2025. URLhttps://arxiv.org/abs/2507.23523

arXiv 2025
[9]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[10]

Bousmalis, G

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V. Dalibard, M. Zambelli, M. Martins, R. Pevceviciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. Żołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rothörl,...

arXiv 2023
[11]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments.arXiv preprint arXiv:...

arXiv 2024
[12]

Carion, L

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko...

Pith/arXiv arXiv 2025
[13]

H. Chen, B. Sun, A. Zhang, M. Pollefeys, and S. Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672, 2025

2025
[14]

L.-H. Chen, J. Zhang, Y. Li, Y. Pang, X. Xia, and T. Liu. Humanmac: Masked motion completion for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9544–9555, 2023

2023
[15]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URLhttps://arxiv.org/abs/2303.04137. 12

Pith/arXiv arXiv 2024
[16]

Clark, Y

C. Clark, Y. Yang, J. S. Park, Z. Ma, J. Zhang, R. Tripathi, M. Salehi, S. Lee, T. Anderson, W. Han, et al. Molmopoint: Better pointing for vlms with grounding tokens.arXiv preprint arXiv:2603.28069, 2026

arXiv 2026
[17]

Clark, J

C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y. Yang, V. Shao, Y. Yang, W. Huang, Z. Gao, T. Anderson, J. Zhang, J. Jain, G. Stoica, W. Han, A. Farhadi, and R. Krishna. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026

Pith/arXiv arXiv 2026
[18]

Deshpande, M

A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y. Kim, Q. Pfeifer, Y.-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. VanderBilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmobot: Large-scale simulation enables zero-shot manipu...

arXiv 2026
[19]

Dharmarajan, W

K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

arXiv 2025
[20]

Doersch, A

C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35: 13610–13626, 2022

2022
[21]

Doersch, Y

C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement.arXiv preprint arXiv:2306.08637, 2023

arXiv 2023
[22]

Doersch, P

C. Doersch, P. Luc, Y. Yang, D. Gokay, S. Koppula, A. Gupta, J. Heyward, I. Rocco, R. Goroshin, J. Carreira, and A. Zisserman. Bootstap: Bootstrapped training for tracking-any-point.arXiv preprint arXiv:2402.00847, 2024

arXiv 2024
[23]

H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y. R. Wang, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Pith/arXiv arXiv 2026
[24]

H. Feng, J. Zhang, Q. Wang, Y. Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152, 2025

arXiv 2025
[25]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017
[26]

Garrido, T

Q. Garrido, T. Nagarajan, B. Terver, N. Ballas, Y. LeCun, and M. Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026

arXiv 2026
[27]

Gibson.The Ecological Approach to Visual Perception

J. Gibson.The Ecological Approach to Visual Perception. Resources for ecological psychology. Lawrence Erlbaum Associates, 1986. ISBN 9780898599596. URLhttps://books.google.com/books?id=DrhCCWmJpWUC

1986
[28]

R. G. Goswami, A. Bar, D. Fan, T.-Y. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y. LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025

arXiv 2025
[29]

Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, W. Wang, and Y. Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

arXiv 2025
[30]

A. W. Harley, Y. You, X. Sun, Y. Zheng, N. Raghuraman, Y. Gu, S. Liang, W.-H. Chu, A. Dave, P. Tokmakov, S. You, R. Ambrus, K. Fragkiadaki, and L. J. Guibas. Alltracker: Efficient dense point tracking at high resolution. arXiv preprint arXiv:2506.07310, 2025

arXiv 2025
[31]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025
[32]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023
[33]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taixé, and S. Fidler. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[34]

Huang, Y.-W

W. Huang, Y.-W. Chao, A. Mousavian, M.-Y. Liu, D. Fox, K. Mo, and L. Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026. 13

arXiv 2026
[35]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[36]

L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

arXiv 2024
[37]

Karaev, I

N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024

arXiv 2024
[38]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. Kumar, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[39]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[40]

Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, A. Guru, P. Wolters, A. Herrasti, Y.-C. Lee, G. Chalvatzaki, Y. Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot naviga...

arXiv 2026
[41]

Koppula, I

S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems, 37:82149–82165, 2024

2024
[42]

C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychology, 49(4):764–766, 2013

2013
[43]

G. Li, Y. Lyu, Z. Liu, C. Hou, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

arXiv 2025
[44]

Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids.arXiv preprint arXiv:1810.01566, 2018

Pith/arXiv arXiv 2018
[45]

Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

2025
[46]

Lipman, R

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[47]

Y. Liu, Y. Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022

2022
[48]

Z. Liu, S. Li, E. Cousineau, S. Feng, B. Burchfiel, and S. Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

Pith/arXiv arXiv 2025
[49]

B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InProceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc

1981
[50]

W. Mao, M. Liu, M. Salzmann, and H. Li. Learning trajectory dependencies for human motion prediction, 2020. URLhttps://arxiv.org/abs/1908.05436

arXiv 2020
[51]

Mendonca, S

R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

arXiv 2023
[52]

Transformersaresample-efficientworldmodels

V.Micheli, E.Alonso, andF.Fleuret. Transformersaresample-efficientworldmodels. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=vhFu1Acb0xb

2023
[53]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 14

2024
[54]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration
[55]

IEEE, 2024

In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[56]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[57]

Perrett, A

T. Perrett, A. Darkhalil, S. Sinha, O. Emara, S. Pollard, K. K. Parida, K. Liu, P. Gatti, S. Bansal, K. Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

2025
[58]

Pont-Tuset, F

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Pith/arXiv arXiv 2017
[59]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024
[60]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations

Ropedia AI. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations. https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026. Hugging Face dataset

2026
[61]

Sanchez-Gonzalez, J

A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. InInternational conference on machine learning, pages 8459–8468. PMLR, 2020

2020
[62]

Soraki, H

R. Soraki, H. Bharadhwaj, A. Farhadi, and R. Mottaghi. Objectforesight: Predicting future 3d object trajectories from human videos.arXiv preprint arXiv:2601.05237, 2026

arXiv 2026
[63]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023
[64]

Teed and J

Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020

2020
[65]

Thakkar, S

N. Thakkar, S. Ginosar, J. Walker, J. Malik, J. Carreira, and C. Doersch. Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026

arXiv 2026
[66]

V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, 2019

2019
[67]

Ullman.The Interpretation of Visual Motion

S. Ullman.The Interpretation of Visual Motion. The MIT Press, 03 1979. ISBN 9780262257121. doi: 10.7551/ mitpress/3877.001.0001. URLhttps://doi.org/10.7551/mitpress/3877.001.0001

work page doi:10.7551/mitpress/3877.001.0001 1979
[68]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[69]

B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, and S. Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 606–617, 2023

2023
[70]

Spatialtrackerv2: 3d point tracking made easy

Y.Xiao, J.Wang, N.Xue, N.Karaev, Y.Makarov, B.Kang, X.Zhu, H.Bao, Y.Shen, andX.Zhou. Spatialtrackerv2: 3d point tracking made easy. InProceedings of the IEEE/CVF International Conference on Computer Vision,
[71]

URLhttps://arxiv.org/abs/2507.12462

arXiv
[72]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[73]

L. Yang, Y. Fan, and N. Xu. Video instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5188–5197, 2019

2019
[74]

R. Yang, Q. Yu, Y. Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y. Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[75]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[76]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y.-W. Chao, B. Y. Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Representations (ICLR), 2025. 15

2025
[77]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Generating 6dof object manipulation trajectories from action description in egocentric vision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17370–17382, 2025

2025
[78]

Zhang, G

C. Zhang, G. L. Moing, S. Koppula, I. Rocco, L. Momeni, J. Xie, S. Sun, R. Sukthankar, J. K. Barral, R. Hadsell, Z. Ghahramani, A. Zisserman, J. Zhang, and M. S. M. Sajjadi. Efficiently reconstructing dynamic scenes one d4rt at a time, 2025. URLhttps://arxiv.org/abs/2512.08924

arXiv 2025
[79]

G. Zhou, H. Pan, Y. Lecun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, p...

2025
[80]

pick up red ceramic coffee mug

H. Zhou, J. Cao, L. Ma, X. Fang, and G. jun Qi. Traj2action: A co-denoising framework for trajectory-guided human-to-robot skill transfer, 2026. URLhttps://arxiv.org/abs/2510.00491. 16 Appendix A Qualitative examples................................................................................ 18 B MolmoMotion-1M Data Generation Details....................

Pith/arXiv arXiv 2026

[1] [1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. Robert Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervised...

2025

[3] [3]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[4] [4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[5] [5]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025

2025

[6] [6]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024

[7] [7]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[8] [8]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation, 2025. URLhttps://arxiv.org/abs/2507.23523

arXiv 2025

[9] [9]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[10] [10]

Bousmalis, G

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V. Dalibard, M. Zambelli, M. Martins, R. Pevceviciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. Żołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rothörl,...

arXiv 2023

[11] [11]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments.arXiv preprint arXiv:...

arXiv 2024

[12] [12]

Carion, L

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko...

Pith/arXiv arXiv 2025

[13] [13]

H. Chen, B. Sun, A. Zhang, M. Pollefeys, and S. Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672, 2025

2025

[14] [14]

L.-H. Chen, J. Zhang, Y. Li, Y. Pang, X. Xia, and T. Liu. Humanmac: Masked motion completion for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9544–9555, 2023

2023

[15] [15]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URLhttps://arxiv.org/abs/2303.04137. 12

Pith/arXiv arXiv 2024

[16] [16]

Clark, Y

C. Clark, Y. Yang, J. S. Park, Z. Ma, J. Zhang, R. Tripathi, M. Salehi, S. Lee, T. Anderson, W. Han, et al. Molmopoint: Better pointing for vlms with grounding tokens.arXiv preprint arXiv:2603.28069, 2026

arXiv 2026

[17] [17]

Clark, J

C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y. Yang, V. Shao, Y. Yang, W. Huang, Z. Gao, T. Anderson, J. Zhang, J. Jain, G. Stoica, W. Han, A. Farhadi, and R. Krishna. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026

Pith/arXiv arXiv 2026

[18] [18]

Deshpande, M

A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y. Kim, Q. Pfeifer, Y.-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. VanderBilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmobot: Large-scale simulation enables zero-shot manipu...

arXiv 2026

[19] [19]

Dharmarajan, W

K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

arXiv 2025

[20] [20]

Doersch, A

C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35: 13610–13626, 2022

2022

[21] [21]

Doersch, Y

C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement.arXiv preprint arXiv:2306.08637, 2023

arXiv 2023

[22] [22]

Doersch, P

C. Doersch, P. Luc, Y. Yang, D. Gokay, S. Koppula, A. Gupta, J. Heyward, I. Rocco, R. Goroshin, J. Carreira, and A. Zisserman. Bootstap: Bootstrapped training for tracking-any-point.arXiv preprint arXiv:2402.00847, 2024

arXiv 2024

[23] [23]

H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y. R. Wang, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Pith/arXiv arXiv 2026

[24] [24]

H. Feng, J. Zhang, Q. Wang, Y. Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152, 2025

arXiv 2025

[25] [25]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017

[26] [26]

Garrido, T

Q. Garrido, T. Nagarajan, B. Terver, N. Ballas, Y. LeCun, and M. Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026

arXiv 2026

[27] [27]

Gibson.The Ecological Approach to Visual Perception

J. Gibson.The Ecological Approach to Visual Perception. Resources for ecological psychology. Lawrence Erlbaum Associates, 1986. ISBN 9780898599596. URLhttps://books.google.com/books?id=DrhCCWmJpWUC

1986

[28] [28]

R. G. Goswami, A. Bar, D. Fan, T.-Y. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y. LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025

arXiv 2025

[29] [29]

Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, W. Wang, and Y. Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

arXiv 2025

[30] [30]

A. W. Harley, Y. You, X. Sun, Y. Zheng, N. Raghuraman, Y. Gu, S. Liang, W.-H. Chu, A. Dave, P. Tokmakov, S. You, R. Ambrus, K. Fragkiadaki, and L. J. Guibas. Alltracker: Efficient dense point tracking at high resolution. arXiv preprint arXiv:2506.07310, 2025

arXiv 2025

[31] [31]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025

[32] [32]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023

[33] [33]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taixé, and S. Fidler. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[34] [34]

Huang, Y.-W

W. Huang, Y.-W. Chao, A. Mousavian, M.-Y. Liu, D. Fox, K. Mo, and L. Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026. 13

arXiv 2026

[35] [35]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[36] [36]

L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

arXiv 2024

[37] [37]

Karaev, I

N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024

arXiv 2024

[38] [38]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. Kumar, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[39] [39]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[40] [40]

Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, A. Guru, P. Wolters, A. Herrasti, Y.-C. Lee, G. Chalvatzaki, Y. Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot naviga...

arXiv 2026

[41] [41]

Koppula, I

S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems, 37:82149–82165, 2024

2024

[42] [42]

C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychology, 49(4):764–766, 2013

2013

[43] [43]

G. Li, Y. Lyu, Z. Liu, C. Hou, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

arXiv 2025

[44] [44]

Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids.arXiv preprint arXiv:1810.01566, 2018

Pith/arXiv arXiv 2018

[45] [45]

Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

2025

[46] [46]

Lipman, R

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[47] [47]

Y. Liu, Y. Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022

2022

[48] [48]

Z. Liu, S. Li, E. Cousineau, S. Feng, B. Burchfiel, and S. Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

Pith/arXiv arXiv 2025

[49] [49]

B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InProceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc

1981

[50] [50]

W. Mao, M. Liu, M. Salzmann, and H. Li. Learning trajectory dependencies for human motion prediction, 2020. URLhttps://arxiv.org/abs/1908.05436

arXiv 2020

[51] [51]

Mendonca, S

R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

arXiv 2023

[52] [52]

Transformersaresample-efficientworldmodels

V.Micheli, E.Alonso, andF.Fleuret. Transformersaresample-efficientworldmodels. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=vhFu1Acb0xb

2023

[53] [53]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 14

2024

[54] [54]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration

[55] [55]

IEEE, 2024

In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[56] [56]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[57] [57]

Perrett, A

T. Perrett, A. Darkhalil, S. Sinha, O. Emara, S. Pollard, K. K. Parida, K. Liu, P. Gatti, S. Bansal, K. Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

2025

[58] [58]

Pont-Tuset, F

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Pith/arXiv arXiv 2017

[59] [59]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024

[60] [60]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations

Ropedia AI. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations. https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026. Hugging Face dataset

2026

[61] [61]

Sanchez-Gonzalez, J

A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. InInternational conference on machine learning, pages 8459–8468. PMLR, 2020

2020

[62] [62]

Soraki, H

R. Soraki, H. Bharadhwaj, A. Farhadi, and R. Mottaghi. Objectforesight: Predicting future 3d object trajectories from human videos.arXiv preprint arXiv:2601.05237, 2026

arXiv 2026

[63] [63]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023

[64] [64]

Teed and J

Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020

2020

[65] [65]

Thakkar, S

N. Thakkar, S. Ginosar, J. Walker, J. Malik, J. Carreira, and C. Doersch. Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026

arXiv 2026

[66] [66]

V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, 2019

2019

[67] [67]

Ullman.The Interpretation of Visual Motion

S. Ullman.The Interpretation of Visual Motion. The MIT Press, 03 1979. ISBN 9780262257121. doi: 10.7551/ mitpress/3877.001.0001. URLhttps://doi.org/10.7551/mitpress/3877.001.0001

work page doi:10.7551/mitpress/3877.001.0001 1979

[68] [68]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[69] [69]

B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, and S. Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 606–617, 2023

2023

[70] [70]

Spatialtrackerv2: 3d point tracking made easy

Y.Xiao, J.Wang, N.Xue, N.Karaev, Y.Makarov, B.Kang, X.Zhu, H.Bao, Y.Shen, andX.Zhou. Spatialtrackerv2: 3d point tracking made easy. InProceedings of the IEEE/CVF International Conference on Computer Vision,

[71] [71]

URLhttps://arxiv.org/abs/2507.12462

arXiv

[72] [72]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[73] [73]

L. Yang, Y. Fan, and N. Xu. Video instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5188–5197, 2019

2019

[74] [74]

R. Yang, Q. Yu, Y. Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y. Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[75] [75]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[76] [76]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y.-W. Chao, B. Y. Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Representations (ICLR), 2025. 15

2025

[77] [77]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Generating 6dof object manipulation trajectories from action description in egocentric vision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17370–17382, 2025

2025

[78] [78]

Zhang, G

C. Zhang, G. L. Moing, S. Koppula, I. Rocco, L. Momeni, J. Xie, S. Sun, R. Sukthankar, J. K. Barral, R. Hadsell, Z. Ghahramani, A. Zisserman, J. Zhang, and M. S. M. Sajjadi. Efficiently reconstructing dynamic scenes one d4rt at a time, 2025. URLhttps://arxiv.org/abs/2512.08924

arXiv 2025

[79] [79]

G. Zhou, H. Pan, Y. Lecun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, p...

2025

[80] [80]

pick up red ceramic coffee mug

H. Zhou, J. Cao, L. Ma, X. Fang, and G. jun Qi. Traj2action: A co-denoising framework for trajectory-guided human-to-robot skill transfer, 2026. URLhttps://arxiv.org/abs/2510.00491. 16 Appendix A Qualitative examples................................................................................ 18 B MolmoMotion-1M Data Generation Details....................

Pith/arXiv arXiv 2026