pith. machine review for the scientific record. sign in

arxiv: 2512.13684 · v2 · submitted 2025-12-15 · 💻 cs.CV

Recognition: no theorem link

Recurrent Video Masked Autoencoders

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords recurrent video masked autoencodersvideo representation learningself-supervised learningaction classificationparameter efficiencyobject trackinggeometric featureslong temporal sequences
0
0 comments X

The pith

A recurrent masked autoencoder learns strong video features from pixel reconstruction alone and matches larger models with up to 30 times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent Video Masked Autoencoders train a transformer-based recurrent network to reconstruct masked video pixels across time. The recurrent design aggregates temporal information frame by frame instead of relying on full spatio-temporal attention. This produces a compact encoder that performs competitively with state-of-the-art video models on action classification and tracking while equaling or surpassing image models on geometric tasks. The approach needs no extra losses or distillation and keeps computation linear even for long sequences. A reader should care because it shows recurrence can deliver efficient, general-purpose video representations from a minimal self-supervised objective.

Core claim

RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient generalist encoder that achieves competitive performance with state-of-the-art video models on video-level tasks like action classification and point and object tracking, while matching or exceeding the performance of image models on tasks that require strong geometric and dense spatial features, with up to 30x greater parameter efficiency and stable linear-cost propagation over long temporal horizons.

What carries the argument

Recurrent transformer-based aggregation that processes masked video frames sequentially under a pixel reconstruction loss.

If this is right

  • RVM reaches competitive results on action classification and tracking against larger video models like VideoMAE and V-JEPA.
  • It matches or exceeds image models such as DINOv2 on geometric and dense spatial tasks.
  • Strong performance appears in the small-model regime without knowledge distillation.
  • Recurrent processing yields stable feature propagation at linear cost over long temporal horizons.
  • Ablation studies confirm that the recurrent aggregation and asymmetric masking drive the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The linear scaling could support processing of hour-long videos on modest hardware where quadratic attention fails.
  • The same recurrent masking pattern might transfer to other sequential data such as audio or time-series sensor streams.
  • Because no distillation is required, training pipelines for video self-supervision become simpler to reproduce.
  • Small efficient encoders of this type could enable on-device video understanding without cloud-scale models.

Load-bearing premise

Pixel reconstruction loss combined with recurrent aggregation is enough to learn rich semantic, structural, and motion representations without extra objectives or distillation.

What would settle it

Train RVM and a non-recurrent video MAE baseline on identical data and video lengths; if the recurrent version shows no gain in parameter efficiency or loses stability on sequences beyond 100 frames, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2512.13684 by Andrew Zisserman, Daniel Zoran, Drew A Hudson, Joao Carreira, Nikhil Parthasarathy, Yi Yang.

Figure 1
Figure 1. Figure 1: Normalized task performance is calculated for each task [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RVM overview. The model encodes source frames from an input video sequentially. Each frame is independently encoded using a vision transformer and the output tokens are aggregated using a transformer-based RNN to produce a sequence of features. See text for full details. During training, a target frame is sampled from a random time gap in the future, masked and encoded using the same ViT encoder. The model… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation suite. Individual frames and annotations from some of the evaluation tasks in this paper, covering semantic, geometry and motion perception. for large-scale video pretraining and largely represent the current frontier in video self supervised learning. Evaluation Suite. We evaluate RVM on a diverse set of benchmarks ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RVM features are uniquely stable over long timescales. We measure temporal stability of visual features by looking at label propagation (feature correspondence) on videos with increasing numbers of frames from the DAVIS 2017 benchmark. RVM per￾formance decays substantially less for long sequences than other SoTA video and image models. model’s performance on 16 frames. As expected, all mod￾els perform wors… view at source ↗
Figure 6
Figure 6. Figure 6: Detecting a white noise square moving on a white noise background. From top to bottom: input sequence, RVM K-means visualization, an example feature map. Note that each frame independently in the input sequence is a white noise image, and thus image models like DINO or Siam-MAE cannot extract any useful information from these. RVM however can integrate temporal information and "see" the moving square. It i… view at source ↗
Figure 5
Figure 5. Figure 5: PCA and K-means of RVM features unrolled on unseen videos. Despite being trained on only 4 frames the model generalizes to long sequences and unrolls stably over long time horizons. As can be seen, the model learns to extract meaningful features from videos. so that we can more efficiently allocate compute. 6. Conclusion We present Recurrent Video Masked-Autoencoders (RVM), a novel framework that leverages… view at source ↗
Figure 7
Figure 7. Figure 7: KMeans visualization on DAVIS video for various ViT￾L/16 models. Unlike RVM, other models produce noisy feature maps lacking structure and consistency. processing with a simple pixel-level training objective may be sufficient for learning strong visual models from natural video data without the need for extra tricks like strong aug￾mentation, EMA networks, regularizers etc. Future work will explore further… view at source ↗
Figure 8
Figure 8. Figure 8: Temporal Stability in Feature Space. Using K-Means clustering (k = 5) on the car-roundabout sequence, we ob￾serve that RVM (Ours) maintains stable cluster assignments for the moving vehicle and the background throughout the clip. In contrast, VideoMAE v2 and 4DS exhibits significant temporal discontinuity ("flickering"), failing to track the object or background consistently over time. First Frame Early Fr… view at source ↗
Figure 9
Figure 9. Figure 9: Robust Foreground-Background Segmentation. In the goat sequence, RVM effectively disentangles the moving animal from the complex environment. While 4DS suffers from background confusion, merging the object with the scene, RVM produces clean, spatially coherent segments that adhere strictly to object boundaries. DINOv2 segments the object well but fails significantly on the background. 17 [PITH_FULL_IMAGE:… view at source ↗
Figure 10
Figure 10. Figure 10: Motion-Aware Instance Separation. Visualizing clus￾ters for the judo sequence. RVM preserves the structural integrity of semantic parts while separating moving instances from static ones (foreground vs. background human). Notably, it filters out the static background human that DINOv2 fails to distinguish. First Frame Early Frame Middle Frame Last Frame Video DINOv2 VideoMAE VideoMAE v2 V-JEPA 4DS RVM [P… view at source ↗
Figure 12
Figure 12. Figure 12: Intrinsic Dimensionality and Smoothness. We project the top-3 principal components of the frozen features to RGB space. RVM exhibits smooth color gradients that naturally follow the object’s geometry, indicating a representation that is both spatially coherent and semantically meaningful. Conversely, features from other models often appear fragmented, lacking clear separation between the foreground and ba… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative evaluation on DAVIS-2017. We propagate segmentation masks using nearest-neighbor retrieval (k = 7) from a context queue of 20 frames. RVM (Ours) maintains accurate object boundaries and temporal consistency compared to baselines like VideoMAE and 4DS, which often exhibit mask degradation or flickering. First Frame Early Frame Middle Frame Last Frame DINOv2 VideoMAE 4DS RVM GT DINOv2 VideoMAE 4… view at source ↗
Figure 15
Figure 15. Figure 15: Semantic part propagation on VIP. We visualize the propagation of dense part labels (arm, leg, hair, etc.) using k = 10 nearest neighbors. RVM distinguishes fine-grained semantic parts and tracks them consistently across the video clip, whereas other methods often confuse adjacent parts (e.g., arm vs. torso). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Recurrent Video Masked Autoencoders (RVM), a video representation learning method that couples an asymmetric masking objective with a transformer-based recurrent neural network, trained solely on pixel reconstruction loss. It claims competitive performance with VideoMAE and V-JEPA on action classification and tracking, matching or exceeding DINOv2 on geometric/dense spatial tasks, up to 30x parameter efficiency in the small-model regime without distillation, and stable linear-cost feature propagation over long temporal horizons.

Significance. If the empirical results hold under scrutiny, RVM would represent a meaningful advance in efficient video encoders by showing that recurrent aggregation can achieve strong semantic, structural, and motion representations without contrastive or predictive auxiliaries, while addressing quadratic attention costs for long sequences and reducing reliance on distillation.

major comments (2)
  1. [Abstract] Abstract: the claim that pixel reconstruction plus recurrence alone yields 'rich representations of scene semantics, structure, and motion' is load-bearing for all downstream performance assertions; the manuscript must supply concrete evidence (e.g., motion-specific probes or an ablation that removes recurrence while keeping capacity fixed) to rule out shortcut solutions based on static cues, as standard video MAEs typically require additional objectives to disentangle dynamics.
  2. [Results] Results section (performance tables): the 30x parameter-efficiency advantage and competitive numbers versus VideoMAE/V-JEPA/DINOv2 must be supported by matched model-size, FLOPs, and training-data comparisons; without these details the efficiency claim cannot be evaluated as load-bearing.
minor comments (1)
  1. [Abstract] Abstract: specify the exact datasets and benchmarks (e.g., Kinetics, Something-Something, DAVIS) used for the reported action classification, tracking, and geometric tasks to allow immediate assessment of scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments below, agreeing that both points warrant additional clarification and evidence in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that pixel reconstruction plus recurrence alone yields 'rich representations of scene semantics, structure, and motion' is load-bearing for all downstream performance assertions; the manuscript must supply concrete evidence (e.g., motion-specific probes or an ablation that removes recurrence while keeping capacity fixed) to rule out shortcut solutions based on static cues, as standard video MAEs typically require additional objectives to disentangle dynamics.

    Authors: We agree that the claim is central and that stronger, more targeted evidence would improve the paper. The current manuscript already contains ablation studies on design factors and qualitative results illustrating motion understanding, but we acknowledge these fall short of the specific controls requested. In the revision we will add (1) a controlled ablation that removes the recurrent component while exactly matching parameter count and capacity, and (2) quantitative motion-specific probes (e.g., optical-flow regression and motion segmentation accuracy) to demonstrate that recurrence contributes beyond static cues. These additions will be placed in the experiments and ablations sections. revision: yes

  2. Referee: [Results] Results section (performance tables): the 30x parameter-efficiency advantage and competitive numbers versus VideoMAE/V-JEPA/DINOv2 must be supported by matched model-size, FLOPs, and training-data comparisons; without these details the efficiency claim cannot be evaluated as load-bearing.

    Authors: We concur that matched comparisons are necessary for the efficiency claim to be credible. Our experiments used small ViT-style backbones with parameter counts aligned to the cited baselines and were pretrained on the same public video corpora (Kinetics-400/600 subsets and ImageNet-derived frames). To make this transparent, the revised results section will include an expanded table reporting exact parameter counts, FLOPs, training data volume, and epoch counts for every compared method, together with a short appendix subsection detailing the matching protocol. This will allow direct evaluation of the reported 30x efficiency gain in the small-model regime. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal without derivations or self-referential claims

full rationale

The paper introduces RVM as a new recurrent video masked autoencoder architecture trained with a pixel reconstruction loss. It contains no equations, derivations, or fitted parameters presented as predictions. All claims rest on external benchmark comparisons (action classification, tracking, geometric tasks) and ablations against prior models like VideoMAE and DINOv2. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify core design choices; the recurrent aggregation and asymmetric masking are presented as design decisions validated empirically. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; standard transformer and masking assumptions are implicit but not enumerated.

pith-pipeline@v0.9.0 · 5527 in / 1014 out tokens · 23061 ms · 2026-05-16T21:53:11.949409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 10 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Nat- sev, George Toderici, Balakrishnan Varadarajan, and Sud- heendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark.arXiv preprint arXiv:1609.08675,

  2. [2]

    Learning to see by moving

    Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. InProceedings of the IEEE international conference on computer vision, pages 37–45, 2015. 2

  3. [4]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 1

  4. [5]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016. 4

  5. [6]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. 2

  6. [7]

    Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features.arXiv preprint arXiv:2307.12698, 2023

    Adrien Bardes, Jean Ponce, and Yann LeCun. Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features.arXiv preprint arXiv:2307.12698, 2023. 1

  7. [8]

    Revisiting feature prediction for learning visual rep- resentations from video.Transactions on Machine Learning Research, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual rep- resentations from video.Transactions on Machine Learning Research, 2024. Featured Certification. 1, 4

  8. [9]

    Possible principles underlying the transformation of sensory messages.Sensory communication, 1(01):217–233, 1961

    Horace B Barlow et al. Possible principles underlying the transformation of sensory messages.Sensory communication, 1(01):217–233, 1961. 1

  9. [10]

    Unifying (machine) vision via counter- factual world modeling.arXiv preprint arXiv:2306.01828,

    Daniel M Bear, Kevin Feigelis, Honglin Chen, Wanhee Lee, Rahul Venkatesh, Klemen Kotar, Alex Durango, and Daniel LK Yamins. Unifying (machine) vision via counter- factual world modeling.arXiv preprint arXiv:2306.01828,

  10. [11]

    Speednet: Learning the speediness in videos

    Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9922–9931, 2020. 2

  11. [12]

    Fully-convolutional siamese networks for object tracking

    Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InEuropean conference on computer vision, pages 850–865. Springer, 2016. 2

  12. [13]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the International Conference on Computer Vision (ICCV), 2021. 2, 15

  13. [14]

    A short note on the kinetics-700 human action dataset

    Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisser- man. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 4, 13

  14. [15]

    Scaling 4D representations

    João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Er- dogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Ke...

  15. [16]

    Learning from one continuous video stream

    Joao Carreira, Michael King, Viorica Patraucean, Dilara Gokay, Catalin Ionescu, Yi Yang, Daniel Zoran, Joseph Hey- ward, Carl Doersch, Yusuf Aytar, et al. Learning from one continuous video stream. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28751–28761, 2024. 2

  16. [17]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2

  17. [18]

    Siamese neural networks: An overview.Arti- ficial neural networks, pages 73–94, 2021

    Davide Chicco. Siamese neural networks: An overview.Arti- ficial neural networks, pages 73–94, 2021. 2

  18. [19]

    On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

    Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259, 2014. 4

  19. [20]

    Out of time: auto- mated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 2

  20. [21]

    Scannet: Richly- annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR,

  21. [22]

    Unsu- pervised visual representation learning by context prediction

    Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsu- pervised visual representation learning by context prediction. InProceedings of the IEEE international conference on com- puter vision, pages 1422–1430, 2015. 2

  22. [23]

    Long-term recurrent convolutional net- works for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2625–2634, 2015. 2

  23. [24]

    An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint, 2020

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint, 2020. 3, 13, 16

  24. [25]

    Efficient image pre-training with siamese cropped 9 masked autoencoders

    Alexandre Eymaël, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogen- broeck. Efficient image pre-training with siamese cropped 9 masked autoencoders. InEuropean Conference on Computer Vision, pages 348–366. Springer, 2024. 2, 4

  25. [26]

    A-jepa: Joint-embedding predictive architecture can listen.arXiv preprint arXiv:2311.15830, 2023

    Zhengcong Fei, Mingyuan Fan, and Junshi Huang. A-jepa: Joint-embedding predictive architecture can listen.arXiv preprint arXiv:2311.15830, 2023. 1

  26. [27]

    A large-scale study on unsupervised spatiotemporal representation learning

    Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Gir- shick, and Kaiming He. A large-scale study on unsupervised spatiotemporal representation learning. InCVPR, 2021. 15

  27. [28]

    Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

    Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

  28. [29]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017. 4, 13, 16

  29. [30]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...

  30. [31]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Ab- hijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour...

  31. [32]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 1

  32. [33]

    S-jepa: Towards seamless cross-dataset transfer through dy- namic spatial attention.arXiv preprint arXiv:2403.11772,

    Pierre Guetschel, Thomas Moreau, and Michael Tangermann. S-jepa: Towards seamless cross-dataset transfer through dy- namic spatial attention.arXiv preprint arXiv:2403.11772,

  33. [34]

    Siamese masked autoencoders.Advances in Neural Information Pro- cessing Systems, 36:40676–40693, 2023

    Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.Advances in Neural Information Pro- cessing Systems, 36:40676–40693, 2023. 2, 3, 4, 7

  34. [35]

    Self- supervised co-training for video representation learning.Ad- vances in neural information processing systems, 33:5679– 5690, 2020

    Tengda Han, Weidi Xie, and Andrew Zisserman. Self- supervised co-training for video representation learning.Ad- vances in neural information processing systems, 33:5679– 5690, 2020. 2

  35. [36]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 2, 4, 15

  36. [37]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers.arXiv preprint arXiv:2205.15868,

  37. [38]

    Space-time correspondence as a contrastive random walk.Advances in neural information processing systems, 33:19545–19560,

    Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk.Advances in neural information processing systems, 33:19545–19560,

  38. [39]

    A survey on contrastive self-supervised learning.Technologies, 9(1):2,

    Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning.Technologies, 9(1):2,

  39. [40]

    Towards understanding action recogni- tion

    Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recogni- tion. InProceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013. 5, 14, 15

  40. [41]

    The kinetics human action video dataset, 2017

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 16

  41. [42]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  42. [43]

    Unsupervised representation learning by sorting sequences

    Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming- Hsuan Yang. Unsupervised representation learning by sorting sequences. InProceedings of the IEEE international confer- ence on computer vision, pages 667–676, 2017. 2

  43. [44]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2

  44. [45]

    Bridge-prompt: Towards ordinal action understanding in instructional videos

    Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, and Jiwen Lu. Bridge-prompt: Towards ordinal action understanding in instructional videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19880–19889, 2022. 2

  45. [46]

    Joint-task self-supervised learning for temporal correspondence.Advances in Neural Information Processing Systems, 32, 2019

    Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, and Ming-Hsuan Yang. Joint-task self-supervised learning for temporal correspondence.Advances in Neural Information Processing Systems, 32, 2019. 15

  46. [47]

    Recurrent convolutional neural network for object recognition

    Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3367–3375, 2015. 2

  47. [48]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 4

  48. [49]

    Videomam- bapro: A leap forward for mamba in video understanding

    Hui Lu, Albert Ali Salah, and Ronald Poppe. Videomam- bapro: A leap forward for mamba in video understanding. arXiv e-prints, pages arXiv–2406, 2024. 2

  49. [50]

    Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630– 2640, 2019. 4, 13 10

  50. [51]

    Shuffle and learn: unsupervised learning using temporal order veri- fication

    Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order veri- fication. InEuropean conference on computer vision, pages 527–544. Springer, 2016. 2

  51. [52]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patrick...

  52. [53]

    Predictive information in a sensory popula- tion.Proceedings of the National Academy of Sciences, 112 (22):6908–6913, 2015

    Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory popula- tion.Proceedings of the National Academy of Sciences, 112 (22):6908–6913, 2015. 1

  53. [54]

    Videomoco: Contrastive video representation learning with temporally adversarial examples

    Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11205–11214, 2021. 2

  54. [55]

    Learning features by watching objects move

    Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2701–2710, 2017. 2

  55. [56]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens Continente, Larisa Markeeva, Dylan Sunil Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, An- toine Miech, Alexandre Fréchette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, ...

  56. [57]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732,

  57. [58]

    Seeing the arrow of time

    Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2035–2042, 2014. 2

  58. [59]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- beláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 5, 14, 15

  59. [60]

    Spatiotempo- ral contrastive video representation learning

    Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotempo- ral contrastive video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6964–6974, 2021. 2

  60. [61]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2

  61. [62]

    Predictive coding in the visual cortex: a functional interpretation of some extra- classical receptive-field effects.Nature neuroscience, 2(1):79,

    Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra- classical receptive-field effects.Nature neuroscience, 2(1):79,

  62. [63]

    Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video

    Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5296–5305, 2017. 4, 13

  63. [64]

    Sensory cortex is optimized for prediction of future input.elife, 7:e31557,

    Yosef Singer, Yayoi Teramoto, Ben DB Willmore, Jan WH Schnupp, Andrew J King, and Nicol S Harper. Sensory cortex is optimized for prediction of future input.elife, 7:e31557,

  64. [65]

    Principles of object perception.Cognitive science, 14(1):29–56, 1990

    Elizabeth S Spelke. Principles of object perception.Cognitive science, 14(1):29–56, 1990. 1

  65. [66]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurélien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

  66. [67]

    Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022. 1, 2, 4

  67. [68]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR,

  68. [69]

    Bevt: Bert pretraining of video transformers

    Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14733–14743, 2022. 2

  69. [70]

    Unsupervised learning of visual representations using videos

    Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. InProceedings of the IEEE international conference on computer vision, pages 2794–2802, 2015. 2

  70. [71]

    Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022

    Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Brégier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and Jérôme Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022. 2

  71. [72]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 2

  72. [73]

    Self-supervised spatiotemporal learning via video clip order prediction

    Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. InProceedings of the IEEE/CVF 11 conference on computer vision and pattern recognition, pages 10334–10343, 2019. 2

  73. [74]

    Rethinking self-supervised correspondence learning: A video frame-level similarity per- spective

    Jiarui Xu and Xiaolong Wang. Rethinking self-supervised correspondence learning: A video frame-level similarity per- spective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10075–10085, 2021. 2

  74. [75]

    YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018. 5, 15

  75. [76]

    Motionmae: Self-supervised video representation learning with motion- aware masked auto encoders.BMVC Proceedings, 2024

    Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, and Zehuan Yuan. Motionmae: Self-supervised video representation learning with motion- aware masked auto encoders.BMVC Proceedings, 2024. 2

  76. [77]

    Recurring the transformer for video action recognition

    Jiewen Yang, Xingbo Dong, Liujun Liu, Chao Zhang, Jiajun Shen, and Dahai Yu. Recurring the transformer for video action recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14063–14073, 2022. 2

  77. [78]

    Video playback rate perception for self-supervised spatio-temporal representation learning

    Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixi- ang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6548–6557, 2020. 2

  78. [79]

    Beyond short snippets: Deep networks for video classifica- tion

    Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijaya- narasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classifica- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015. 2

  79. [80]

    Adap- tive temporal encoding network for video instance-level hu- man parsing

    Qixian Zhou, Xiaodan Liang, Ke Gong, and Liang Lin. Adap- tive temporal encoding network for video instance-level hu- man parsing. InProceedings of the 26th ACM international conference on Multimedia, pages 1527–1535, 2018. 14, 15 12 Recurrent Video Masked Autoencoders Supplementary Material

  80. [81]

    Training data details We use a data mixture very similar to the one proposed in [3], consisting of only data from publically available video datasets. However, we do not apply any extra curation to these datasets and critically don’t rely on ImageNet for addi- tional image-level data as so many prior works do: Source Samples Type FPS Apply Curation Weight...

Showing first 80 references.