Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Gijs Dubbelman; Niccol\`o Cavagnero; Svetlana Orlova

arxiv: 2605.19137 · v1 · pith:BFS466APnew · submitted 2026-05-18 · 💻 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Svetlana Orlova , Niccol\`o Cavagnero , Gijs Dubbelman This is my paper

Pith reviewed 2026-05-20 10:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords video understandingfrozen encodersimage-to-video transferrecurrent temporal modulesdata-efficient pre-trainingfoundation modelstemporal reasoning

0 comments

The pith

A frozen image foundation model plus a simple recurrent module delivers competitive video understanding without large-scale video pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether video models must be pre-trained from scratch on enormous video collections. It instead freezes an existing image foundation model to supply spatial features and trains only a recurrent temporal module on video streams. Results across multiple video understanding benchmarks indicate that effective temporal reasoning appears without full video pre-training or any updates to the spatial encoder. This setup points to a route for building capable video systems at far lower data and compute cost than current end-to-end approaches.

Core claim

By keeping a pre-trained image foundation model frozen and training solely a recurrent temporal module on video data, competitive results are obtained on video understanding benchmarks, showing that substantial temporal capability can arise without end-to-end video pre-training or fine-tuning the spatial encoder.

What carries the argument

A frozen pre-trained image foundation model used as a fixed spatial encoder together with a trainable recurrent temporal module that processes streaming video frames.

If this is right

Video pre-training can proceed with orders-of-magnitude less video data once a strong frozen image encoder is available.
Temporal reasoning can be learned independently once spatial representations are supplied by an image foundation model.
The same image encoder can be reused across many video tasks by retraining only the recurrent module.
Future video foundation models may be constructed by pre-training recurrent modules on top of existing image models rather than training everything jointly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Updating only the temporal module would let practitioners refresh video models quickly when new image encoders appear.
The approach may extend to other sequential domains if strong frozen encoders already exist for their spatial or static components.
Lower training cost could allow video models to be adapted more frequently to new domains or edge devices.

Load-bearing premise

That the spatial features produced by the frozen image model stay sufficiently rich and transferable when paired only with a basic recurrent temporal module and without any adaptation of the image encoder itself.

What would settle it

An experiment showing that the frozen approach falls well short of a fully video-pretrained baseline on a task that demands fine motion discrimination, such as recognizing actions across long untrimmed videos, would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19137 by Gijs Dubbelman, Niccol\`o Cavagnero, Svetlana Orlova.

**Figure 1.** Figure 1: Video Foundation Model vs. Image Foundation Model + Recurrent Head. Comparison of a frozen Video Foundation Model (RVM [25]) vs. a frozen Image Foundation Model (DINOv3 [19]) with a fine-tuned recurrent temporal head, GatedMambaMix (GMMix). DINOv3 achieves similar performance across different tasks without large scale video pre-training. an unprecedented level of capability. Trained on billions of images… view at source ↗

**Figure 2.** Figure 2: Image Pre-training vs. Video Pre-training. GMMix temporal module paired with various pre-trained encoders. All encoders are frozen, only GMMix and the readout are trained from scratch. Image pre-trained encoders consistently match or outperform the video pre-trained RVM encoder. Model Size(M) SSv2 Waymo PT ScanNet NuScenes Norm. Avg Acc. (↑%) mIoU (↑) AJ (↑) AbsRel (↓) RPEtr (↓) (↑) RVM-L 375 46.9 72.7 61.… view at source ↗

**Figure 3.** Figure 3: Impact of Multi-depth Features Using tokens from multiple DINOv3 depths (narrow solid bars) consistently improves or matches final-layer-only tokens (wide dashed bars) across all benchmarks and temporal architectures. Model Init SSv2 Waymo PT ScanNet NuScenes NuScenes Acc. (↑%) mIoU (↑) AJ (↑) AbsRel (↓) RPEtr (↓) RPErot, (↓) RVM-L Random 62.0 78.3 68.4 0.1237 38.66 0.10 RVM-L Pre-train 71.5+9.5 83.4+5.1 7… view at source ↗

**Figure 4.** Figure 4: Data Efficiency on SSv2. DINOv3 + GMMix vs. frozen RVM trained on varying fractions of the SSv2 training set. Dashed line: frozen RVM at 100%. DINOv3 + GMMix surpasses frozen RVM’s full-data performance using less than 25% of the training data. we compare our best model against established video foundation models (Tab. 3). All backbones are frozen; only a 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Freezing image foundation models and training just a recurrent temporal module can deliver strong video performance without full-scale video pre-training, but the supporting evidence is still high-level.

read the letter

The main point is that you can build effective video models by freezing a strong image foundation model and only training a recurrent module for temporal processing, cutting down on the need for large video datasets. This could lower the barrier for video research. This approach reuses existing image pre-training to handle spatial features, focusing compute on temporal reasoning instead. The paper presents this as a feasibility study and shares empirical results on multiple video understanding tasks that support the idea. Releasing the code is a plus for reproducibility and allows others to test the setup. One limitation is the lack of specific numbers or comparisons in the abstract, making it hard to assess the strength of the claims right now. The concern about motion-heavy tasks is valid; if the recurrent module can't bridge gaps in the spatial features for dynamic scenes, the benefits might be limited to simpler cases. This paper would interest people working on efficient foundation models for video. It serves as an initial exploration rather than a complete solution, pointing to future work on recurrent video models. I think it deserves peer review to get detailed feedback on the experiments and potential improvements. The direction is worth pursuing even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The manuscript explores a data-efficient paradigm for video understanding: a pre-trained image foundation model is frozen as a spatial encoder while only a recurrent temporal module is trained on streaming video. The central claim, based on empirical findings across multiple video tasks, is that strong temporal performance can emerge without large-scale video pre-training, thereby reducing data and compute costs compared to end-to-end video foundation models.

Significance. If the reported findings hold under scrutiny, the work offers a practical route to lower the substantial costs of video pre-training by reusing mature image representations. The public code release at https://github.com/tue-mps/towards-video-image-frozen is a clear strength that supports reproducibility and further investigation of recurrent video models.

major comments (2)

[Abstract] Abstract: the claim that 'strong temporal performance can emerge without large-scale video pre-training' is presented without any quantitative results, baselines, ablation details, or dataset sizes. This absence makes it impossible to evaluate whether the empirical findings actually support the feasibility conclusion.
[Approach and Experiments] Approach and Experiments sections: the load-bearing assumption that frozen image-foundation features remain sufficiently rich and transferable for motion-heavy tasks (deformation, occlusion, viewpoint change) is not tested with targeted ablations. If the recurrent module cannot compensate, the method reduces to a lightweight baseline rather than a viable alternative to video pre-training.

minor comments (2)

The title refers to 'Video Pre-training' yet the method trains only a temporal module on top of a frozen encoder; a brief clarification of terminology would avoid reader confusion.
Figure and table captions should explicitly state the video datasets and task metrics used so that the empirical claims can be assessed at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the presentation of our results and assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'strong temporal performance can emerge without large-scale video pre-training' is presented without any quantitative results, baselines, ablation details, or dataset sizes. This absence makes it impossible to evaluate whether the empirical findings actually support the feasibility conclusion.

Authors: We agree that the abstract, as currently written, is high-level and does not include quantitative highlights. While the full manuscript provides detailed results, baselines, ablations, and dataset sizes in the Experiments section, we will revise the abstract to incorporate concise quantitative statements (e.g., relative performance on standard benchmarks and data/compute savings) to better support the feasibility claim for readers. revision: yes
Referee: [Approach and Experiments] Approach and Experiments sections: the load-bearing assumption that frozen image-foundation features remain sufficiently rich and transferable for motion-heavy tasks (deformation, occlusion, viewpoint change) is not tested with targeted ablations. If the recurrent module cannot compensate, the method reduces to a lightweight baseline rather than a viable alternative to video pre-training.

Authors: We acknowledge that our current experiments, while covering video tasks that involve motion, deformation, occlusion, and viewpoint variation, do not include dedicated ablations that isolate these factors. We will add targeted ablations in the revised manuscript that systematically vary these conditions to demonstrate the contribution of the recurrent temporal module and confirm that frozen image features remain effective when paired with it. revision: yes

Circularity Check

0 steps flagged

Empirical feasibility study with no derivation chain or circular reductions

full rationale

The paper is framed as an empirical exploration of reusing frozen image foundation models plus a simple recurrent temporal module for video tasks, without large-scale video pre-training. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-referential definitions by construction. Central claims rest on reported task performances across video understanding benchmarks, which are externally verifiable through experiments rather than internally forced. This is a self-contained empirical study against external benchmarks with no load-bearing self-citations or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that image-model spatial features transfer directly to video when a recurrent temporal module is added; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Image foundation models provide powerful spatial representations that are transferable to video tasks when combined with a recurrent temporal module.
This premise is invoked in the abstract as the justification for freezing the image model and training only the temporal component.

pith-pipeline@v0.9.0 · 5726 in / 1092 out tokens · 26845 ms · 2026-05-20T10:25:16.333402+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework decouples spatial and temporal learning... frozen image encoder... recurrent temporal module... GatedMambaMix (GMMix)
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strong temporal performance can emerge without large-scale video pre-training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 9 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learn- ing visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 5

work page 2020
[4]

Jo ˜ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdo- gan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Mo- ing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V ´elez, Luisa Polan ´ıa, Luke Friedman, Chris Duvar- ney, Ross G...

work page arXiv 2024
[5]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 3

work page 2014
[6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5

work page 2017
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

The” something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

work page 2017
[9]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1, 2

work page 2022
[10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022. 2

work page 2022
[12]

Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

work page
[13]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean Conference on Computer Vision, pages 237–255. Springer, 2024. 2

work page 2024
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023. 2, 5

work page 2023
[17]

Viorica P ˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo˜ao Carreira, and Razvan Pascanu. Trecvit: A re- current video transformer.arXiv preprint arXiv:2412.14294,

work page arXiv
[18]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[19]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, 9 Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Coupri...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

work page 2020
[21]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022. 1, 2, 5

work page 2022
[22]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning, pages 10347–10357, 2021. 5

work page 2021
[23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

On the continuity of rotation representations in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753,

work page
[25]

Recurrent Video Masked Autoencoders

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A. Hud- son, Jo˜ao Carreira, and Andrew Zisserman. Recurrent video masked autoencoders.arXiv preprint arXiv:2512.13684,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

1, 2, 3, 4, 5 10 Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models Supplementary Material The offline evaluation protocol, including readout archi- tectures and training procedures, follows [4, 25]. In this supplementary material, we explain the streaming evalu- ation, which reflects the real-world scenario of receiving video f...

work page
[27]

Streaming Tasks Table 4 provides an overview of each task, including the training dataset, loss function, evaluation metric, and read- out head parameters. The readout heads are based on the cross-attention architecture from [25], adapted to operate on single-frame tokens (Ntokens per frame) instead of the full spatio-temporal sequence (T×Ntokens). The en...

work page
[28]

We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup

Training Settings All tasks share the same training configuration unless stated otherwise. We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup. Training uses mixed precision (bf16). Our protocol is based on 4DS [4], which used 40K steps for frozen training (only the ...

work page
[29]

mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames

Evaluation Metrics Top-1 accuracy (SSv2).Standard classification accuracy on the validation set. mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames. Average Jaccard (PT).Following the Perception Test benchmark [16], AJ is defined as the average of Jaccard val- ues at position...

work page

[1] [1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learn- ing visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 5

work page 2020

[4] [4]

Jo ˜ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdo- gan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Mo- ing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V ´elez, Luisa Polan ´ıa, Luke Friedman, Chris Duvar- ney, Ross G...

work page arXiv 2024

[5] [5]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 3

work page 2014

[6] [6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5

work page 2017

[7] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[8] [8]

The” something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

work page 2017

[9] [9]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1, 2

work page 2022

[10] [10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022. 2

work page 2022

[12] [12]

Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

work page

[13] [13]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean Conference on Computer Vision, pages 237–255. Springer, 2024. 2

work page 2024

[14] [14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023. 2, 5

work page 2023

[17] [17]

Viorica P ˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo˜ao Carreira, and Razvan Pascanu. Trecvit: A re- current video transformer.arXiv preprint arXiv:2412.14294,

work page arXiv

[18] [18]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021

[19] [19]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, 9 Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Coupri...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

work page 2020

[21] [21]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022. 1, 2, 5

work page 2022

[22] [22]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning, pages 10347–10357, 2021. 5

work page 2021

[23] [23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

On the continuity of rotation representations in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753,

work page

[25] [25]

Recurrent Video Masked Autoencoders

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A. Hud- son, Jo˜ao Carreira, and Andrew Zisserman. Recurrent video masked autoencoders.arXiv preprint arXiv:2512.13684,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

1, 2, 3, 4, 5 10 Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models Supplementary Material The offline evaluation protocol, including readout archi- tectures and training procedures, follows [4, 25]. In this supplementary material, we explain the streaming evalu- ation, which reflects the real-world scenario of receiving video f...

work page

[27] [27]

Streaming Tasks Table 4 provides an overview of each task, including the training dataset, loss function, evaluation metric, and read- out head parameters. The readout heads are based on the cross-attention architecture from [25], adapted to operate on single-frame tokens (Ntokens per frame) instead of the full spatio-temporal sequence (T×Ntokens). The en...

work page

[28] [28]

We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup

Training Settings All tasks share the same training configuration unless stated otherwise. We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup. Training uses mixed precision (bf16). Our protocol is based on 4DS [4], which used 40K steps for frozen training (only the ...

work page

[29] [29]

mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames

Evaluation Metrics Top-1 accuracy (SSv2).Standard classification accuracy on the validation set. mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames. Average Jaccard (PT).Following the Perception Test benchmark [16], AJ is defined as the average of Jaccard val- ues at position...

work page