The TIME Machine: On The Power of Motion for Efficient Perception
Pith reviewed 2026-05-25 05:38 UTC · model grok-4.3
The pith
Motion point tracks trained via masked autoencoding match state-of-the-art video models with up to 10,000 times less data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training a masked autoencoder to reconstruct masked point tracks from synthetic motion data produces a Temporally Informed Motion Embedding (TIME) that, when used in a zero-shot manner, performs on par with state-of-the-art video models on standard tasks despite using up to four orders of magnitude less training data.
What carries the argument
Masked autoencoder on sequences of point tracks that learns to predict missing motion trajectories, serving as the self-supervised objective for motion-based video representations.
Load-bearing premise
Point-track motion data by itself is enough to learn useful representations for typical video tasks.
What would settle it
Observing that the TIME embedding underperforms significantly compared to appearance or language-based models on multiple video benchmarks when both are evaluated zero-shot.
Figures
read the original abstract
Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TIME (Temporally Informed Motion Embedding), a self-supervised video representation obtained by training a masked autoencoder to reconstruct masked point tracks from synthetic motion data. It claims this motion-only approach bypasses language supervision and appearance dependence, yielding zero-shot performance on par with state-of-the-art video models across a wide range of tasks while using up to four orders of magnitude less training data, and improving temporal understanding.
Significance. If the zero-shot parity claim holds with rigorous evidence, the result would be significant: it would demonstrate that purely geometric motion representations can match or exceed appearance- and language-supervised models on standard video benchmarks at dramatically lower data and compute cost, opening a path to more scalable and temporally precise video models.
major comments (2)
- [Abstract] Abstract: the central empirical claim of 'performance on par with state-of-the-art models using up to 4 orders of magnitude less training data' is stated without any metrics, baselines, task list, evaluation protocol, or quantitative comparison, so the claim cannot be assessed from the provided text.
- [Abstract] The manuscript provides no evidence that point-track motion alone encodes the fine-grained semantics required for transfer to standard video-understanding tasks that are appearance-dependent; the generalization step from synthetic tracks to real-video benchmarks therefore remains unsecured.
Simulated Author's Rebuttal
We thank the referee for the comments on our manuscript. We address each major comment below, clarifying the structure of the paper and the evidence provided in the experimental sections.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim of 'performance on par with state-of-the-art models using up to 4 orders of magnitude less training data' is stated without any metrics, baselines, task list, evaluation protocol, or quantitative comparison, so the claim cannot be assessed from the provided text.
Authors: The abstract is designed to be a concise high-level summary of the key contribution and findings. The specific metrics, baselines (including state-of-the-art video models), task list (covering action recognition, temporal action localization, video question answering, and others), evaluation protocols, and quantitative comparisons are fully detailed in the Experiments section of the manuscript, where zero-shot results are reported against models trained on orders of magnitude more data. revision: no
-
Referee: [Abstract] The manuscript provides no evidence that point-track motion alone encodes the fine-grained semantics required for transfer to standard video-understanding tasks that are appearance-dependent; the generalization step from synthetic tracks to real-video benchmarks therefore remains unsecured.
Authors: The manuscript provides extensive empirical evidence through zero-shot transfer experiments on real-world, appearance-dependent benchmarks. These include evaluations on datasets requiring fine-grained semantic understanding (e.g., action recognition on Kinetics and Something-Something, video QA on ActivityNet-QA), where the motion-only TIME embedding achieves performance on par with appearance- and language-supervised models despite training exclusively on synthetic point tracks. The results, including ablations on motion vs. appearance, are presented in Sections 4 and 5, demonstrating the generalization from synthetic motion data to real videos. revision: no
Circularity Check
No circularity: purely empirical claims with no derivation chain
full rationale
The paper describes a masked autoencoder trained on synthetic point tracks to produce TIME embeddings, then reports zero-shot transfer performance on video tasks. No equations, fitted parameters presented as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central claim is an observational performance statement (on-par results with 4 orders less data) that does not reduce to its inputs by construction. This is the expected non-finding for an empirical methods paper without mathematical derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Motion point-tracks contain sufficient information for video representation learning independent of appearance.
invented entities (1)
-
TIME embedding
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks... trained exclusively on synthetic point trajectories from rigid-body physics simulations from Kubric
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Figure 1: ... Arrow of Time task
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark, 2016. URLhttps://arxiv.org/abs/1609.08675
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1728–1738, October 2021
work page 2021
-
[4]
Is space-time attention all you need for video understanding? InICML, 2021
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, 2021
work page 2021
-
[5]
Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. What happens next? anticipating future motion by generating point trajectories.arXiv preprint arXiv:2509.21592, 2025
-
[6]
Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025
Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URLhttps://arxiv.org/abs/2506.09849
-
[7]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024. URLhttps://arxiv.org/abs/2410.10818
-
[8]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[9]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017
work page 2017
-
[10]
A short note on the kinetics- 700 human action dataset, 2022
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics- 700 human action dataset, 2022. URLhttps://arxiv.org/abs/1907.06987
-
[11]
Wei-Yi Chen, Yi-Ling Lin, Yu-An Su, Wei-Hsin Yeh, and Lun-Wei Ku. Yourskatingcoach: A figure skating video benchmark for fine-grained element analysis.ArXiv, abs/2410.20427, 2024. URLhttps://api.semanticscholar.org/CorpusID:273654721
-
[12]
It’s a matter of time: Three lessons on long-term motion for perception, 2026
Willem Davison, Xinyue Hao, and Laura Sevilla-Lara. It’s a matter of time: Three lessons on long-term motion for perception, 2026. URLhttps://arxiv.org/abs/2602.14705
-
[13]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[14]
Flownet: Learning optical flow with convolutional networks
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 10
work page 2015
-
[15]
Who’s better? who’s best? pairwise deep ranking for skill determination
Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas. Who’s better? who’s best? pairwise deep ranking for skill determination. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6057–6066, 2018
work page 2018
-
[16]
Action modifiers: Learning from adverbs in instructional videos
Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. Action modifiers: Learning from adverbs in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 868–878, 2020
work page 2020
-
[17]
Ovr: A dataset for open vocabulary temporal repetition counting in videos, 2024
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, and Andrew Zisserman. Ovr: A dataset for open vocabulary temporal repetition counting in videos, 2024. URL https://arxiv.org/ abs/2407.17085
-
[18]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis
Kunchang Fu, Zhenjiang Dai, Jianwei Guo, Yinan He, Yuqi Zuo, Chao Chen, Ziyue Yu, Yuxia Li, Zhe Chen, Zhaoyang Liu, Hao Wang, Yang Fang, Jianing Liu, Jiaming Hao, Bingkun Jiang, Dapeng Chen, Yucheng Zhao, Zhenyu Wang, Siyu Chen, Rui Qian, Ruihang Xie, Yiming Chen, Shunqi Yao, Yongting Sun, Zhiyi Deng, Mingjie Wang, Liangyu Chen, Tingyu Qu, Sizhe Wang, Shu...
work page 2025
-
[19]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Frund, Peter Yianilos, Moritz Mueller-Freitag, et al. The "something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017
work page 2017
-
[20]
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...
-
[21]
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek 11 Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, ...
work page 2022
-
[22]
Chunhui Gu, Chen Sun, David A. Ross, Carl V ondrick, Caroline Pantofaru, Yeqing Li, Sudheen- dra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions,
-
[23]
URLhttps://arxiv.org/abs/1705.08421
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked au- toencoders are scalable vision learners, 2021. URL https://arxiv.org/abs/2111.06377
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8450–8460, June 2025
work page 2025
-
[27]
Hsin-Ping Huang, Charles Herrmann, Junhwa Hur, Erika Lu, Kyle Sargent, Austin Stone, Ming-Hsuan Yang, and Deqing Sun. Self-supervised autoflow. InCVPR, 2023
work page 2023
-
[28]
Cotracker3: Simpler and better point tracking by pseudo-labelling real videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proc. arXiv:2410.11831, 2024
-
[29]
Large-scale video classification with convolutional neural networks
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732, 2014
work page 2014
-
[30]
Uni- formerv2: Spatiotemporal learning by arming image vits with video uniformer, 2022
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uni- formerv2: Spatiotemporal learning by arming image vits with video uniformer, 2022
work page 2022
-
[31]
Diving-48: A large-scale fine-grained video dataset for fine-grained action recognition
Yansong Li, Jie Song, Yong Li, Min Liu, Xiaojie Guo, and Zheng-Jun Zha. Diving-48: A large-scale fine-grained video dataset for fine-grained action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5574–5583, 2021
work page 2021
-
[32]
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv: 2403.00476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URLhttps://arxiv.org/abs/1906.03327
-
[34]
Learning action changes by measuring verb-adverb textual relationships
Davide Moltisanti, Frank Keller, Hakan Bilen, and Laura Sevilla-Lara. Learning action changes by measuring verb-adverb textual relationships. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23110–23118, June 2023
work page 2023
- [35]
-
[36]
Lyndsey Pickup, Zheng Pan, Donglai Wei, Yichang Shih, Andrew Zisserman, William T Freeman, and Bernhard Schölkopf. Seeing the arrow of time. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2035–2042, 2014. 12
work page 2035
-
[37]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...
work page 2021
-
[38]
Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael S. Ryoo. Self-supervised video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2874–2884, June 2022
work page 2022
-
[39]
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube- boundingboxes: A large high-precision human-annotated data set for object detection in video,
-
[40]
URLhttps://arxiv.org/abs/1702.00824
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Broaden your views for self-supervised video learning
Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica P˘atr˘aucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, and Andrew Zisserman. Broaden your views for self-supervised video learning. InProceedings of the IEEE/CVF International Conference on Computer Vision...
work page 2021
-
[42]
Only time can tell: Discovering temporal data for temporal modeling
Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 535–544, January 2021
work page 2021
-
[43]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[44]
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, and Mohamed Elhoseiny. Time blindness: Why video-language models can’t see what humans can?, 2025. URL https://arxiv.org/ abs/2505.24867
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Videomae v2: Scaling video masked autoencoders with dual masking, 2023
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking, 2023. URL https://arxiv.org/abs/2303.16727
-
[46]
Next-qa: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021
work page 2021
-
[47]
Clevrer: Collision events for video representation and reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Joshua Tenenbaum, and Antonio Torralba. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[48]
Merlot reserve: Multimodal neural script knowledge through vision and language and sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Multimodal neural script knowledge through vision and language and sound. InCVPR, 2022
work page 2022
-
[49]
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachan- dra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models.arXiv preprint arXiv:2508.02095, 2025
-
[50]
Recurrent Video Masked Autoencoders
Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, and Andrew Zisserman. Recurrent video masked autoencoders, 2025. URL https://arxiv.org/abs/ 2512.13684. 13 A TIME Model Training In this section we provide more extensive details of our model training. We report the numbers that were used to train the TIME model on 250,000 syntheti...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.