MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Pith reviewed 2026-06-26 21:26 UTC · model grok-4.3
The pith
A model forecasts future 3D trajectories of object points given short video history and language goal descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a short visual history, a set of 3D query points on an object, and a language description of the intended goal, MolmoMotion predicts the future 3D trajectory of each point; the learned prior transfers to robot manipulation by improving training efficiency and generalization and supplies motion guidance that lets generative models produce videos with more realistic object motion.
What carries the argument
Goal-conditioned 3D point motion forecasting that maps visual history plus language goal to future point trajectories via either autoregressive coordinate prediction or flow-matching generation.
If this is right
- The model outperforms existing motion prediction baselines on PointMotionBench across 111 object categories and 61 motion types.
- The learned 3D motion prior raises training efficiency and generalization when applied to robot manipulation policies.
- Predicted trajectories supply motion guidance that lets video generative models synthesize sequences with more realistic object motion.
Where Pith is reading between the lines
- The same point-based representation could support planning in multi-object scenes once interaction terms are added.
- Because the output is a set of 3D trajectories rather than pixel flow, it may integrate directly with physics simulators for verification.
- Scaling the dataset further could allow zero-shot transfer to novel motion types not present in the current 61-category benchmark.
Load-bearing premise
3D points extracted from unconstrained videos remain a sufficiently general and view-stable representation for forecasting and for the claimed transfers to manipulation and video synthesis without extra object-specific modeling.
What would settle it
A controlled test in which predicted trajectories produce no measurable gain in robot task success rate or no reduction in motion artifacts in synthesized videos would show the representation does not transfer as claimed.
read the original abstract
Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the task of goal-conditioned 3D point motion forecasting, where a model predicts future 3D trajectories of query points on an object given visual history and a language goal description. It presents MolmoMotion-1M, a corpus of 1.16M video-derived 3D point trajectories with action descriptions; PointMotionBench, a human-verified benchmark across 111 categories and 61 motion types; and MolmoMotion, a model supporting autoregressive and flow-matching trajectory prediction. The work claims that MolmoMotion significantly outperforms motion prediction baselines on the benchmark and that the learned 3D motion prior transfers to improve robot manipulation training and to guide more realistic object motion in video synthesis.
Significance. If the quantitative claims hold with proper controls, the work would establish 3D point trajectories as a compact, class-agnostic representation for scalable motion forecasting, potentially benefiting robotics and generative video models by providing a view-stable prior learned from large-scale video data.
major comments (2)
- [Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.
- [Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our dataset construction and experimental presentation. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Dataset and benchmark sections] Dataset construction (MolmoMotion-1M): The central claim that 3D points in world coordinates extracted from unconstrained videos form a view-stable, general representation is load-bearing for both the forecasting task and the downstream transfer results, yet the manuscript provides no quantification of depth/pose estimation errors or experiments demonstrating invariance under viewpoint changes. Systematic biases in the 1.16M-video corpus would directly affect PointMotionBench scores and the reported robot/video gains.
Authors: We agree this is a substantive gap. While the 3D point representation is derived from standard monocular reconstruction pipelines, the manuscript does not include explicit error quantification or viewpoint-invariance tests. In the revision we will add a dedicated analysis subsection that reports depth/pose error statistics on held-out sequences with available ground truth, plus controlled experiments that re-render the same trajectories from novel viewpoints to measure consistency. These additions will directly support the view-stability claim and allow readers to assess potential corpus biases. revision: yes
-
Referee: [Experiments section] Experiments and results: The abstract asserts significant outperformance over baselines and successful transfer, but the provided text contains no tables, metrics, error bars, ablations, or statistical tests. Without these, the magnitude and reliability of the claimed improvements cannot be assessed, undermining the soundness of the central claims.
Authors: The full manuscript contains a complete Experiments section (Section 4) with quantitative tables, per-category metrics, error bars from repeated runs, ablation studies, and statistical significance tests. If these elements were not visible in the review copy, we will reformat and prominently place all tables, figures, and statistical details in the revised submission so that the magnitude and reliability of the reported gains are fully transparent and reproducible. revision: yes
Circularity Check
No circularity: model, dataset, and benchmark are externally constructed and evaluated
full rationale
The paper defines a new forecasting task, constructs MolmoMotion-1M from 1.16M external videos, introduces a human-verified PointMotionBench spanning 111 categories, and trains MolmoMotion to outperform baselines on it. No equations, parameter-fitting steps, or self-citation chains are described that would reduce any claimed prediction to an input by construction. The 3D point representation is presented as an argued modeling choice whose utility is tested empirically rather than assumed via self-reference. The derivation chain therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[2]
Assran, A
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. Robert Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervised...
2025
-
[3]
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[4]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[5]
Banerjee, S
P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025
2025
-
[6]
H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024
Pith/arXiv arXiv 2024
-
[7]
Bharadhwaj, R
H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision (ECCV), 2024
2024
-
[8]
H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation, 2025. URLhttps://arxiv.org/abs/2507.23523
arXiv 2025
-
[9]
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[10]
K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V. Dalibard, M. Zambelli, M. Martins, R. Pevceviciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. Żołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rothörl,...
arXiv 2023
-
[11]
J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments.arXiv preprint arXiv:...
arXiv 2024
-
[12]
N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko...
Pith/arXiv arXiv 2025
-
[13]
H. Chen, B. Sun, A. Zhang, M. Pollefeys, and S. Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672, 2025
2025
-
[14]
L.-H. Chen, J. Zhang, Y. Li, Y. Pang, X. Xia, and T. Liu. Humanmac: Masked motion completion for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9544–9555, 2023
2023
-
[15]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URLhttps://arxiv.org/abs/2303.04137. 12
Pith/arXiv arXiv 2024
- [16]
-
[17]
C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y. Yang, V. Shao, Y. Yang, W. Huang, Z. Gao, T. Anderson, J. Zhang, J. Jain, G. Stoica, W. Han, A. Farhadi, and R. Krishna. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026
Pith/arXiv arXiv 2026
-
[18]
A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y. Kim, Q. Pfeifer, Y.-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. VanderBilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmobot: Large-scale simulation enables zero-shot manipu...
arXiv 2026
-
[19]
K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025
arXiv 2025
-
[20]
Doersch, A
C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35: 13610–13626, 2022
2022
-
[21]
C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement.arXiv preprint arXiv:2306.08637, 2023
arXiv 2023
-
[22]
C. Doersch, P. Luc, Y. Yang, D. Gokay, S. Koppula, A. Gupta, J. Heyward, I. Rocco, R. Goroshin, J. Carreira, and A. Zisserman. Bootstap: Bootstrapped training for tracking-any-point.arXiv preprint arXiv:2402.00847, 2024
arXiv 2024
-
[23]
H. Fang, J. Duan, D. Clay, S. Wang, S. Liu, W. Huang, X. Fan, W.-C. Tsai, S. Chen, Y. R. Wang, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026
Pith/arXiv arXiv 2026
-
[24]
H. Feng, J. Zhang, Q. Wang, Y. Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152, 2025
arXiv 2025
-
[25]
Finn and S
C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017
2017
-
[26]
Q. Garrido, T. Nagarajan, B. Terver, N. Ballas, Y. LeCun, and M. Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026
arXiv 2026
-
[27]
Gibson.The Ecological Approach to Visual Perception
J. Gibson.The Ecological Approach to Visual Perception. Resources for ecological psychology. Lawrence Erlbaum Associates, 1986. ISBN 9780898599596. URLhttps://books.google.com/books?id=DrhCCWmJpWUC
1986
-
[28]
R. G. Goswami, A. Bar, D. Fan, T.-Y. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y. LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025
arXiv 2025
-
[29]
Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, W. Wang, and Y. Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025
arXiv 2025
-
[30]
A. W. Harley, Y. You, X. Sun, Y. Zheng, N. Raghuraman, Y. Gu, S. Liang, W.-H. Chu, A. Dave, P. Tokmakov, S. You, R. Ambrus, K. Fragkiadaki, and L. J. Guibas. Alltracker: Efficient dense point tracking at high resolution. arXiv preprint arXiv:2506.07310, 2025
arXiv 2025
-
[31]
R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
Pith/arXiv arXiv 2025
-
[32]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
Pith/arXiv arXiv 2023
-
[33]
J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taixé, and S. Fidler. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
Pith/arXiv arXiv 2025
-
[34]
W. Huang, Y.-W. Chao, A. Mousavian, M.-Y. Liu, D. Fox, K. Mo, and L. Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026. 13
arXiv 2026
-
[35]
Huang, Y
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[36]
L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024
arXiv 2024
- [37]
-
[38]
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. Kumar, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[39]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[40]
Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, A. Guru, P. Wolters, A. Herrasti, Y.-C. Lee, G. Chalvatzaki, Y. Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot naviga...
arXiv 2026
-
[41]
Koppula, I
S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems, 37:82149–82165, 2024
2024
-
[42]
C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychology, 49(4):764–766, 2013
2013
-
[43]
G. Li, Y. Lyu, Z. Liu, C. Hou, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025
arXiv 2025
-
[44]
Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids.arXiv preprint arXiv:1810.01566, 2018
Pith/arXiv arXiv 2018
-
[45]
Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025
2025
-
[46]
Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[47]
Y. Liu, Y. Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022
2022
-
[48]
Z. Liu, S. Li, E. Cousineau, S. Feng, B. Burchfiel, and S. Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025
Pith/arXiv arXiv 2025
-
[49]
B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InProceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc
1981
-
[50]
W. Mao, M. Liu, M. Salzmann, and H. Li. Learning trajectory dependencies for human motion prediction, 2020. URLhttps://arxiv.org/abs/1908.05436
arXiv 2020
-
[51]
R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023
arXiv 2023
-
[52]
Transformersaresample-efficientworldmodels
V.Micheli, E.Alonso, andF.Fleuret. Transformersaresample-efficientworldmodels. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=vhFu1Acb0xb
2023
-
[53]
Ghosh, H
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 14
2024
-
[54]
O’Neill, A
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration
-
[55]
IEEE, 2024
In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[56]
Peebles and S
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[57]
Perrett, A
T. Perrett, A. Darkhalil, S. Sinha, O. Emara, S. Pollard, K. K. Parida, K. Liu, P. Gatti, S. Bansal, K. Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025
2025
-
[58]
J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017
Pith/arXiv arXiv 2017
-
[59]
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
Pith/arXiv arXiv 2024
-
[60]
Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations
Ropedia AI. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations. https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026. Hugging Face dataset
2026
-
[61]
Sanchez-Gonzalez, J
A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. InInternational conference on machine learning, pages 8459–8468. PMLR, 2020
2020
- [62]
-
[63]
J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864
Pith/arXiv arXiv 2023
-
[64]
Teed and J
Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020
2020
-
[65]
N. Thakkar, S. Ginosar, J. Walker, J. Malik, J. Carreira, and C. Doersch. Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026
arXiv 2026
-
[66]
V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, 2019
2019
-
[67]
Ullman.The Interpretation of Visual Motion
S. Ullman.The Interpretation of Visual Motion. The MIT Press, 03 1979. ISBN 9780262257121. doi: 10.7551/ mitpress/3877.001.0001. URLhttps://doi.org/10.7551/mitpress/3877.001.0001
-
[68]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[69]
B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, and S. Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 606–617, 2023
2023
-
[70]
Spatialtrackerv2: 3d point tracking made easy
Y.Xiao, J.Wang, N.Xue, N.Karaev, Y.Makarov, B.Kang, X.Zhu, H.Bao, Y.Shen, andX.Zhou. Spatialtrackerv2: 3d point tracking made easy. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[71]
URLhttps://arxiv.org/abs/2507.12462
-
[72]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[73]
L. Yang, Y. Fan, and N. Xu. Video instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5188–5197, 2019
2019
-
[74]
R. Yang, Q. Yu, Y. Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y. Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025
Pith/arXiv arXiv 2025
-
[75]
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[76]
S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y.-W. Chao, B. Y. Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Representations (ICLR), 2025. 15
2025
-
[77]
Yoshida, S
T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Generating 6dof object manipulation trajectories from action description in egocentric vision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17370–17382, 2025
2025
- [78]
-
[79]
G. Zhou, H. Pan, Y. Lecun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, p...
2025
-
[80]
pick up red ceramic coffee mug
H. Zhou, J. Cao, L. Ma, X. Fang, and G. jun Qi. Traj2action: A co-denoising framework for trajectory-guided human-to-robot skill transfer, 2026. URLhttps://arxiv.org/abs/2510.00491. 16 Appendix A Qualitative examples................................................................................ 18 B MolmoMotion-1M Data Generation Details....................
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.