Recognition: unknown
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3
The pith
A 64x compressed motion embedding learned from tracker trajectories enables efficient generation of realistic long-term motions conditioned on text or pokes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. We first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
What carries the argument
the long-term motion embedding with 64x temporal compression learned from tracker trajectories, which carries the argument by allowing direct operation on compressed latents for conditional flow-matching generation instead of full video synthesis
Load-bearing premise
That a motion embedding learned solely from tracker trajectories with 64x temporal compression retains sufficient information to generate realistic long-term motions conditioned on text or pokes.
What would settle it
If side-by-side evaluations on held-out long video sequences show that motions generated from the embedding are less realistic or less consistent than outputs from full video synthesis models according to standard metrics or human raters, the central efficiency and quality claim would not hold.
Figures
read the original abstract
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to learn a highly compressed (64x temporal) long-term motion embedding directly from large-scale tracker trajectories, then train a conditional flow-matching model in this latent space to generate realistic motions conditioned on text prompts or spatial pokes. The central assertion is that the resulting motion distributions outperform both state-of-the-art video models and specialized task-specific baselines.
Significance. If the empirical claims hold, the approach would offer a substantial efficiency gain for long-horizon motion generation by avoiding full video synthesis, with relevance to robotics, animation, and predictive visual intelligence. The combination of tracker-derived embeddings and conditional flow matching is a plausible direction for scalable kinematics modeling.
major comments (3)
- [Abstract] Abstract: the headline claim that the generated motion distributions 'outperform those of both state-of-the-art video models and specialized task-specific approaches' is presented without any quantitative metrics, ablation results, or evaluation protocol, which is load-bearing for the central contribution.
- [Embedding learning] Embedding learning section: the 64x temporal compression of tracker trajectories is asserted to retain sufficient information for realistic conditioned generation, yet no reconstruction fidelity analysis, information-retention study, or comparison of high-frequency dynamics (contact events, velocity changes) is provided; this directly affects whether the latent space can support the claimed superiority.
- [Experiments] Experiments / evaluation: no details are given on the flow-matching training objective, conditioning mechanisms for text/pokes, datasets, baselines, or statistical significance of the outperformance claims, preventing verification of the method's soundness.
minor comments (1)
- [Abstract] Abstract: the phrase 'large-scale trajectories obtained from tracker models' should specify the particular trackers and source datasets used.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying aspects of the work and indicating where revisions will be made to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that the generated motion distributions 'outperform those of both state-of-the-art video models and specialized task-specific approaches' is presented without any quantitative metrics, ablation results, or evaluation protocol, which is load-bearing for the central contribution.
Authors: We agree that the abstract should more explicitly support its central claim. The full manuscript includes quantitative results in the Experiments section, with comparisons against video models and task-specific baselines using metrics such as motion FID, diversity, and human preference rates, along with details on the evaluation protocol and datasets. We will revise the abstract to briefly reference these supporting elements (e.g., 'outperform ... as shown by quantitative metrics and user studies on standard benchmarks'). revision: yes
-
Referee: [Embedding learning] Embedding learning section: the 64x temporal compression of tracker trajectories is asserted to retain sufficient information for realistic conditioned generation, yet no reconstruction fidelity analysis, information-retention study, or comparison of high-frequency dynamics (contact events, velocity changes) is provided; this directly affects whether the latent space can support the claimed superiority.
Authors: The Embedding Learning section reports reconstruction results from the compressed latents, including overall trajectory fidelity metrics. However, we acknowledge that dedicated analysis of high-frequency elements such as contact events and velocity profiles is not separately highlighted. We will add this in the revision, including quantitative comparisons of reconstructed vs. original high-frequency dynamics and visualizations to better demonstrate information retention. revision: partial
-
Referee: [Experiments] Experiments / evaluation: no details are given on the flow-matching training objective, conditioning mechanisms for text/pokes, datasets, baselines, or statistical significance of the outperformance claims, preventing verification of the method's soundness.
Authors: The Experiments section details the conditional flow-matching objective (standard velocity-field parameterization), conditioning mechanisms (cross-attention for text prompts via CLIP embeddings and direct feature concatenation for spatial pokes), the large-scale tracker-derived datasets, baselines (including recent video diffusion models and specialized motion generators), and statistical reporting (means and standard deviations over multiple seeds). If these elements were insufficiently prominent, we will expand the descriptions, add pseudocode, and clarify the evaluation protocol in the revised manuscript. revision: yes
Circularity Check
Empirical training pipeline with no derivation chain or self-referential reductions
full rationale
The paper describes a two-stage empirical process: first training a motion embedding from large-scale tracker trajectories at 64x temporal compression, then training a conditional flow-matching model on the resulting latents to generate motions from text or poke conditions. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or prior self-citations. The central claim of outperforming video models and task-specific baselines rests on experimental comparisons rather than any algebraic identity or load-bearing self-reference. This is a standard self-contained ML pipeline whose validity is evaluated externally against benchmarks, yielding no circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tracker models produce sufficiently accurate and diverse large-scale trajectories for learning a general motion embedding.
- domain assumption Flow-matching models can generate high-quality distributions in the compressed latent space.
invented entities (1)
-
long-term motion embedding
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sora 2, 2025
Open AI. Sora 2, 2025. 2
2025
-
[2]
Kelsey R Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Patrick Murphy, Joao Carreira, and Sjoerd van Steenkiste. Direct motion models for assessing generated videos. InForty-second Inter- national Conference on Machine Learning, 2025. 2
2025
-
[3]
Back to the features: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468,
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the fea- tures: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468, 2025. 2
-
[4]
Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Conference on Learning Representa- tions, 2025
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veliˇckovi´c. Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Conference on Learning Representa- tions, 2025. 4
2025
-
[5]
Kalayeh, and Bj¨orn Ommer
Stefan Andreas Baumann, Jannik Wiese, Tommaso Mar- torella, Mahdi M. Kalayeh, and Bj¨orn Ommer. Envisioning the future, one step at a time, 2026. 6
2026
-
[6]
Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024. 3, 6
2024
-
[7]
ipoke: Poking a still image for controlled stochastic video synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj¨orn Ommer. ipoke: Poking a still image for controlled stochastic video synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14707– 14717, 2021. 3, 4
2021
-
[8]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127,
work page internal anchor Pith review arXiv
-
[9]
Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. What happens next? antici- pating future motion by generating point trajectories.arXiv preprint arXiv:2509.21592, 2025. 2, 6, 12
-
[10]
VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InForty-second International Conference on Machine Learning, 2025. 3
2025
-
[11]
Jeremy A Collins, Lor ´and Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, and Animesh Garg. Amplify: Ac- tionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025. 2, 3, 7, 13
-
[12]
Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Ship- pole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InForty-first Interna- tional Conference on Machine Learning, 2024. 4, 13, 14
2024
-
[13]
Veo, 2025
Google DeepMind. Veo, 2025. 2, 8
2025
-
[14]
Bert: Pre-training of deep bidirectional transform- ers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. 5
2019
-
[15]
Motion prompt- ing: Controlling video generation with motion trajectories
Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompt- ing: Controlling video generation with motion trajectories. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3
2025
-
[16]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...
2017
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[18]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 4
2022
-
[19]
beta-V AE: Learning basic visual con- cepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InInterna- tional Conference on Learning Representations, 2017. 4
2017
-
[20]
MiniCPM: Unveiling the potential of small language models with scalable training strategies
Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...
2024
-
[21]
Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos
Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6013–6022,
-
[22]
arXiv preprint arXiv:2412.11673 (2024) 4
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Dino-foresight: Looking into the future with dino.arXiv preprint arXiv:2412.11673, 2024. 2
-
[23]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
Molmoact: Action reasoning models that can reason in space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Boyang Li, Shuo Liu, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hen- drix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space. InWorkshop on Making Sense of Data in Robot...
2025
-
[25]
Movideo: Motion-aware video generation with diffusion model
Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. Movideo: Motion-aware video generation with diffusion model. InEuropean Conference on Computer Vision, pages 56–74. Springer, 2024. 3
2024
-
[26]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 5
2023
-
[27]
Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 6, 7, 12
2023
-
[28]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5, 13, 14
2019
-
[29]
InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 7, 15
-
[30]
Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InEuropean Conference on Computer Vision, pages 111–128. Springer, 2024. 2, 3
2024
-
[31]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 2, 13, 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017. 7, 15
work page internal anchor Pith review arXiv 2017
-
[33]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 14
work page internal anchor Pith review arXiv 2002
-
[34]
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 2, 3, 6, 15
2024
-
[35]
Dragdiffusion: Harnessing diffusion models for interactive point-based image editing
Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Han- shu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8839–8849, 2024. 3
2024
-
[36]
Instant- drag: Improving interactivity in drag-based image editing
Joonghyuk Shin, Daehyeon Choi, and Jaesik Park. Instant- drag: Improving interactivity in drag-based image editing. In SIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3
2024
-
[37]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[38]
Fourier features let networks learn high frequency functions in low dimensional domains.Advances in neural information processing systems, 33:7537–7547, 2020
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains.Advances in neural information processing systems, 33:7537–7547, 2020. 4
2020
-
[39]
Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026
Neerja Thakkar, Shiry Ginosar, Jacob C Walker, Jitendra Malik, Jo˜ao Carreira, and Carl Doersch. Forecasting motion in the wild.arXiv preprint arXiv:2604.01015, 2026. 3
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Openvideo: Pexels-raw (720p) video dataset
UmiMarch. Openvideo: Pexels-raw (720p) video dataset. https : / / github . com / UmiMarch / OpenVideo,
-
[42]
6, 8, 12
Accessed: 2025-11-10. 6, 8, 12
2025
-
[43]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4
2017
-
[44]
An uncertain future: Forecasting from static images using variational autoencoders
Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static images using variational autoencoders. InEuropean conference on computer vision, pages 835–851. Springer, 2016. 2
2016
-
[45]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 8, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Koala-36m: A large-scale video dataset improv- ing consistency between fine-grained conditions and video content
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jia- hao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improv- ing consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025. 5
2025
-
[47]
Any-point trajectory modeling for policy learning,
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023. 3, 5, 7, 8, 12, 13
-
[48]
Draganything: Motion control for any- thing using entity representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024. 3
2024
-
[49]
Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing
Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6960–6970, 2025. 3, 5, 7, 8, 12, 13
2025
-
[50]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[51]
Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019. 14
2019
-
[52]
arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 3
-
[53]
TraceVLA: Visual trace prompting enhances spatial- temporal awareness for generalist robotic policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting enhances spatial- temporal awareness for generalist robotic policies. InThe Thirteenth International Conference on Learning Representa- tions, 2025. 3
2025
- [54]
-
[55]
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. 2, 3 Learning Long-term Motion Embeddings for Efficient Kinematics Generation Supplementary Material A. Additional Evaluation Details LIBERO Trajectory PredictionWe compare the accu- ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.