Recognition: unknown
Seeing Fast and Slow: Learning the Flow of Time in Videos
Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3
The pith
Self-supervised models learn to perceive and control the flow of time in videos by detecting speed changes and estimating playback speeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which transforms low-FPS, blurry 3D
What carries the argument
self-supervised temporal reasoning models trained to detect speed changes and estimate playback speed from multimodal and temporal cues in videos
If this is right
- A large high-quality slow-motion dataset becomes available for training without manual labeling.
- Video generators can be conditioned on a target playback speed so the same scene can be rendered fast or slow on demand.
- Low-frame-rate input can be turned into high-frame-rate output that recovers fine-grained motion details.
- Temporal forensics tasks such as spotting speed tampering become feasible with the same learned speed detectors.
Where Pith is reading between the lines
- The same speed-estimation signal could be used to flag edited or synthetic video that contains inconsistent timing.
- World models for robotics or simulation might improve if they explicitly represent how events unfold at different timescales.
- Extending the approach to audio-visual alignment could let models learn consistent speed across sound and image streams.
Load-bearing premise
Multimodal cues and temporal structure inside ordinary videos are rich enough to let a model reliably spot speed changes and estimate playback speed even when the source videos are noisy and uncurated.
What would settle it
Training the same generation and super-resolution models on a random sample of ordinary video instead of the curated slow-motion collection yields no measurable gain in temporal coherence or detail.
read the original abstract
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to treat time as a learnable visual concept in videos by developing self-supervised models that detect speed changes and estimate playback speed from multimodal and temporal cues. These models are then used to curate the largest slow-motion video dataset from noisy in-the-wild sources, which in turn supports new models for speed-conditioned video generation and temporal super-resolution.
Significance. If the self-supervised speed estimation step is reliable, the work would provide a valuable large-scale slow-motion dataset and demonstrate new capabilities for temporal control in video models, potentially advancing generative video methods, temporal forensics, and richer world models that reason about event timing. The self-supervised curation approach is a notable strength if validated.
major comments (1)
- [Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address the major comment on the abstract below and have made revisions to improve clarity regarding the reliability of the self-supervised pipeline.
read point-by-point responses
-
Referee: [Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.
Authors: We agree that the abstract, in its original form, emphasizes the high-level approach without embedding specific quantitative results, which limits immediate assessment of the pipeline's reliability. The body of the manuscript contains the requested quantitative validation, including accuracy and error metrics for speed change detection and playback speed estimation, ablations addressing real-world factors such as camera motion and temporal discontinuities, and comparisons against baselines. To directly address this point, we have revised the abstract to incorporate key quantitative highlights from our experiments while preserving its concise nature. This change makes the load-bearing role of the self-supervised step more transparent to readers. revision: yes
Circularity Check
No significant circularity; self-supervised pipeline is data-driven and self-contained
full rationale
The paper presents a self-supervised learning pipeline that exploits naturally occurring multimodal and temporal cues in videos to train models for speed change detection and playback speed estimation. These models are then applied to curate a slow-motion dataset from in-the-wild sources, which in turn supports downstream tasks like speed-conditioned generation and temporal super-resolution. No equations, derivations, or self-citations appear that reduce any claimed prediction or result to its own inputs by construction. The approach relies on external data patterns rather than fitted parameters renamed as predictions or ansatzes smuggled via prior self-work. This is the standard case of an honest non-finding: the derivation chain does not collapse into tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network hyperparameters and training settings
axioms (1)
- domain assumption Videos contain sufficient multimodal and temporal structure to support self-supervised inference of playback speed changes.
Reference graph
Works this paper leans on
-
[1]
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025
-
[2]
Test of time: Instilling video-language models with a sense of time
Piyush Bagad, Makarand Tapaswi, and Cees GM Snoek. Test of time: Instilling video-language models with a sense of time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023
2023
-
[3]
Ac3d: Analyzing and improving 3d camera control in video diffusion transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025
2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, 2021
2021
-
[6]
Speednet: Learning the speediness in videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. InCVPR, 2020
2020
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Learning to synthesize motion blur
Tim Brooks and Jonathan T Barron. Learning to synthesize motion blur. InCVPR, 2019. 12
2019
-
[9]
Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. InCVPR, 2025
2025
-
[10]
Jiaben Chen and Huaizu Jiang. Sportsslomo: A new benchmark and baselines for human-centric video frame interpolation.arXiv:2308.16876, 2023
-
[11]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024
2024
-
[12]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Flolpips: A bespoke video quality metric for frame interpolation
Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. InPicture Coding Symposium, 2022
2022
-
[14]
Ldmvfi: Video frame interpolation with latent diffusion models
Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. In AAAI, 2024
2024
-
[15]
Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025
Xi Ding and Lei Wang. Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025
2025
-
[16]
Scvrl: Shuffled contrastive video representation learning
Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, and Davide Modolo. Scvrl: Shuffled contrastive video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4132–4141, 2022
2022
-
[17]
Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval
Yang Du, Yuqi Liu, and Qin Jin. Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5260–5269, 2024
2024
-
[18]
Explorative inbetweening of time and space
Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J Black, and Xuaner Zhang. Explorative inbetweening of time and space. InECCV, 2024
2024
-
[19]
Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, and Zhengzhong Tu. The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026
-
[20]
Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018
Amir Ghodrati, Efstratios Gavves, and Cees GM Snoek. Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018
-
[21]
Cover: A comprehensive video quality evaluator
Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehensive video quality evaluator. InCVPR W, 2024
2024
-
[22]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017
2017
-
[23]
arXiv preprint arXiv:2512.25075 (2025)
Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y Wang, Joan Lasenby, and Chun- Hao Huang. Spacetimepilot: Generative rendering of dynamic scenes across space and time.arXiv:2512.25075, 2025
-
[24]
Real-time intermediate flow estimation for video frame interpolation
Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InECCV, 2022
2022
-
[25]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024
2024
-
[26]
Video interpolation with diffusion models
Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. InCVPR, 2024
2024
-
[27]
Super slomo: High quality estimation of multiple intermediate frames for video interpolation
Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. InCVPR, 2018
2018
-
[28]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 13
2025
-
[29]
Cotracker3: Simpler and better point tracking by pseudo-labelling real videos
Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InICCV, 2025
2025
-
[30]
Need for speed: A benchmark for higher frame rate object tracking
Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. InICCV, 2017
2017
-
[31]
Unsupervised representation learning by sorting sequences
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. InProceedings of the IEEE international conference on computer vision, pages 667–676, 2017
2017
-
[32]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355, 2023
work page internal anchor Pith review arXiv 2023
-
[33]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024
2024
-
[34]
Video frame synthesis using deep voxel flow
Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. InICCV, 2017
2017
-
[35]
Beyond the frame: Generating 360°panoramic videos from perspective videos
Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, and Wei-Chiu Ma. Beyond the frame: Generating 360°panoramic videos from perspective videos. InICCV, 2025
2025
-
[36]
Uvg dataset: 50/120fps 4k sequences for video codec analysis and development
Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM multimedia systems conference, pages 297–302, 2020
2020
-
[37]
Shuffle and learn: unsupervised learning using temporal order verification
Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. InEuropean conference on computer vision, pages 527–544. Springer, 2016
2016
-
[38]
Deep multi-scale convolutional neural network for dynamic scene deblurring
Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InCVPR, 2017
2017
-
[39]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv:2407.02371, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
Video frame interpolation via adaptive convolution
Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. InCVPR, 2017
2017
-
[41]
Perazzi, J
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016
2016
-
[42]
Seeing the arrow of time
Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. InCVPR, 2014
2014
-
[43]
Film: Frame interpolation for large motion
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. InECCV, 2022
2022
-
[44]
Xvfi: extreme video frame interpolation
Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: extreme video frame interpolation. InICCV, 2021
2021
-
[45]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, 2024
2024
-
[46]
Transnet v2: An effective deep network architecture for fast shot transition detection
Tomás Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACM MM, 2024
2024
-
[47]
Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V
Verena Steinhof, Anna Schroeger, Roman Liepelt, and Laura Sperl. Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V. steinhof et al.Cognitive Research: Principles and Implications, 2025
2025
-
[48]
Deep video deblurring for hand-held cameras
Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InCVPR, 2017
2017
-
[49]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR W, 2019
2019
-
[51]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
arXiv preprint arXiv:2505.22944 (2025) 4
Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv:2505.22944, 2025
-
[53]
Self-supervised video representation learning by pace prediction
Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. InECCV, 2020
2020
-
[54]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR, 2023
2023
-
[55]
Generative inbetweening: Adapting image-to-video models for keyframe interpolation
Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. InICLR, 2024
2024
-
[56]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023
work page internal anchor Pith review arXiv 2023
-
[57]
Sea-raft: Simple, efficient, accurate raft for optical flow
Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. InECCV, 2024
2024
-
[58]
Bullettime: Decoupled control of time and camera pose for video generation
Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, and Gordon Wetzstein. Bullettime: Decoupled control of time and camera pose for video generation. arXiv:2512.05076, 2025
-
[59]
Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023
Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023
2023
-
[60]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024
2024
-
[61]
Learning and using the arrow of time
Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. InCVPR, 2018
2018
-
[62]
Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution
Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. InCVPR, 2020
2020
-
[63]
Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025
Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025
-
[64]
Video playback rate perception for self-supervised spatio-temporal representation learning
Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InCVPR, 2020
2020
-
[65]
Dptext-detr: Towards better scene text detection with dynamic points in transformer
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. Dptext-detr: Towards better scene text detection with dynamic points in transformer. InAAAI, 2023
2023
-
[66]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
2018
-
[67]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv:2503.21755, 2025
work page internal anchor Pith review arXiv 2025
-
[68]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv:2504.10479, 2025. 15 A Supplementary Material Overview In this supplementary document, we provide additional implementation details ...
work page internal anchor Pith review arXiv 2025
-
[69]
Watch the video and use the preview slider (1.0× to 16.0×) to experiment with different speeds
-
[70]
Based on your observation, determine what speed factor would make it look normal
-
[71]
Speed Factor
Enter that factor in the "Speed Factor" input box
-
[72]
Click Done to save your answer and move to the next video
-
[73]
4" in the input box • If the video appears 2× slow when slider is at 10× → Enter
If you cannot determine a reasonable speed, click Cannot Tell (use sparingly) Important Notes • Please make sure you're using Chrome • Your answer can exceed these limits - enter any value you believe is correct Example Scenarios • If the video looks normal at 4× on the slider → Enter "4" in the input box • If the video appears 2× slow when slider is at 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.