pith. sign in

arxiv: 2605.19137 · v1 · pith:BFS466APnew · submitted 2026-05-18 · 💻 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Pith reviewed 2026-05-20 10:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords video understandingfrozen encodersimage-to-video transferrecurrent temporal modulesdata-efficient pre-trainingfoundation modelstemporal reasoning
0
0 comments X

The pith

A frozen image foundation model plus a simple recurrent module delivers competitive video understanding without large-scale video pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether video models must be pre-trained from scratch on enormous video collections. It instead freezes an existing image foundation model to supply spatial features and trains only a recurrent temporal module on video streams. Results across multiple video understanding benchmarks indicate that effective temporal reasoning appears without full video pre-training or any updates to the spatial encoder. This setup points to a route for building capable video systems at far lower data and compute cost than current end-to-end approaches.

Core claim

By keeping a pre-trained image foundation model frozen and training solely a recurrent temporal module on video data, competitive results are obtained on video understanding benchmarks, showing that substantial temporal capability can arise without end-to-end video pre-training or fine-tuning the spatial encoder.

What carries the argument

A frozen pre-trained image foundation model used as a fixed spatial encoder together with a trainable recurrent temporal module that processes streaming video frames.

If this is right

  • Video pre-training can proceed with orders-of-magnitude less video data once a strong frozen image encoder is available.
  • Temporal reasoning can be learned independently once spatial representations are supplied by an image foundation model.
  • The same image encoder can be reused across many video tasks by retraining only the recurrent module.
  • Future video foundation models may be constructed by pre-training recurrent modules on top of existing image models rather than training everything jointly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Updating only the temporal module would let practitioners refresh video models quickly when new image encoders appear.
  • The approach may extend to other sequential domains if strong frozen encoders already exist for their spatial or static components.
  • Lower training cost could allow video models to be adapted more frequently to new domains or edge devices.

Load-bearing premise

That the spatial features produced by the frozen image model stay sufficiently rich and transferable when paired only with a basic recurrent temporal module and without any adaptation of the image encoder itself.

What would settle it

An experiment showing that the frozen approach falls well short of a fully video-pretrained baseline on a task that demands fine motion discrimination, such as recognizing actions across long untrimmed videos, would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19137 by Gijs Dubbelman, Niccol\`o Cavagnero, Svetlana Orlova.

Figure 1
Figure 1. Figure 1: Video Foundation Model vs. Image Foundation Model + Recurrent Head. Comparison of a frozen Video Foun￾dation Model (RVM [25]) vs. a frozen Image Foundation Model (DINOv3 [19]) with a fine-tuned recurrent temporal head, Gat￾edMambaMix (GMMix). DINOv3 achieves similar performance across different tasks without large scale video pre-training. an unprecedented level of capability. Trained on billions of images… view at source ↗
Figure 2
Figure 2. Figure 2: Image Pre-training vs. Video Pre-training. GMMix temporal module paired with various pre-trained encoders. All encoders are frozen, only GMMix and the readout are trained from scratch. Image pre-trained encoders consistently match or outperform the video pre-trained RVM encoder. Model Size(M) SSv2 Waymo PT ScanNet NuScenes Norm. Avg Acc. (↑%) mIoU (↑) AJ (↑) AbsRel (↓) RPEtr (↓) (↑) RVM-L 375 46.9 72.7 61.… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Multi-depth Features Using tokens from multiple DINOv3 depths (narrow solid bars) consistently improves or matches final-layer-only tokens (wide dashed bars) across all benchmarks and temporal architectures. Model Init SSv2 Waymo PT ScanNet NuScenes NuScenes Acc. (↑%) mIoU (↑) AJ (↑) AbsRel (↓) RPEtr (↓) RPErot, (↓) RVM-L Random 62.0 78.3 68.4 0.1237 38.66 0.10 RVM-L Pre-train 71.5+9.5 83.4+5.1 7… view at source ↗
Figure 4
Figure 4. Figure 4: Data Efficiency on SSv2. DINOv3 + GMMix vs. frozen RVM trained on varying fractions of the SSv2 training set. Dashed line: frozen RVM at 100%. DINOv3 + GMMix surpasses frozen RVM’s full-data performance using less than 25% of the training data. we compare our best model against established video foun￾dation models (Tab. 3). All backbones are frozen; only a 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript explores a data-efficient paradigm for video understanding: a pre-trained image foundation model is frozen as a spatial encoder while only a recurrent temporal module is trained on streaming video. The central claim, based on empirical findings across multiple video tasks, is that strong temporal performance can emerge without large-scale video pre-training, thereby reducing data and compute costs compared to end-to-end video foundation models.

Significance. If the reported findings hold under scrutiny, the work offers a practical route to lower the substantial costs of video pre-training by reusing mature image representations. The public code release at https://github.com/tue-mps/towards-video-image-frozen is a clear strength that supports reproducibility and further investigation of recurrent video models.

major comments (2)
  1. [Abstract] Abstract: the claim that 'strong temporal performance can emerge without large-scale video pre-training' is presented without any quantitative results, baselines, ablation details, or dataset sizes. This absence makes it impossible to evaluate whether the empirical findings actually support the feasibility conclusion.
  2. [Approach and Experiments] Approach and Experiments sections: the load-bearing assumption that frozen image-foundation features remain sufficiently rich and transferable for motion-heavy tasks (deformation, occlusion, viewpoint change) is not tested with targeted ablations. If the recurrent module cannot compensate, the method reduces to a lightweight baseline rather than a viable alternative to video pre-training.
minor comments (2)
  1. The title refers to 'Video Pre-training' yet the method trains only a temporal module on top of a frozen encoder; a brief clarification of terminology would avoid reader confusion.
  2. Figure and table captions should explicitly state the video datasets and task metrics used so that the empirical claims can be assessed at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the presentation of our results and assumptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'strong temporal performance can emerge without large-scale video pre-training' is presented without any quantitative results, baselines, ablation details, or dataset sizes. This absence makes it impossible to evaluate whether the empirical findings actually support the feasibility conclusion.

    Authors: We agree that the abstract, as currently written, is high-level and does not include quantitative highlights. While the full manuscript provides detailed results, baselines, ablations, and dataset sizes in the Experiments section, we will revise the abstract to incorporate concise quantitative statements (e.g., relative performance on standard benchmarks and data/compute savings) to better support the feasibility claim for readers. revision: yes

  2. Referee: [Approach and Experiments] Approach and Experiments sections: the load-bearing assumption that frozen image-foundation features remain sufficiently rich and transferable for motion-heavy tasks (deformation, occlusion, viewpoint change) is not tested with targeted ablations. If the recurrent module cannot compensate, the method reduces to a lightweight baseline rather than a viable alternative to video pre-training.

    Authors: We acknowledge that our current experiments, while covering video tasks that involve motion, deformation, occlusion, and viewpoint variation, do not include dedicated ablations that isolate these factors. We will add targeted ablations in the revised manuscript that systematically vary these conditions to demonstrate the contribution of the recurrent temporal module and confirm that frozen image features remain effective when paired with it. revision: yes

Circularity Check

0 steps flagged

Empirical feasibility study with no derivation chain or circular reductions

full rationale

The paper is framed as an empirical exploration of reusing frozen image foundation models plus a simple recurrent temporal module for video tasks, without large-scale video pre-training. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-referential definitions by construction. Central claims rest on reported task performances across video understanding benchmarks, which are externally verifiable through experiments rather than internally forced. This is a self-contained empirical study against external benchmarks with no load-bearing self-citations or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that image-model spatial features transfer directly to video when a recurrent temporal module is added; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Image foundation models provide powerful spatial representations that are transferable to video tasks when combined with a recurrent temporal module.
    This premise is invoked in the abstract as the justification for freezing the image model and training only the temporal component.

pith-pipeline@v0.9.0 · 5726 in / 1092 out tokens · 26845 ms · 2026-05-20T10:25:16.333402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 9 internal anchors

  1. [1]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

  2. [2]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learn- ing visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

  3. [3]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 5

  4. [4]

    Jo ˜ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdo- gan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Mo- ing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V ´elez, Luisa Polan ´ıa, Luke Friedman, Chris Duvar- ney, Ross G...

  5. [5]

    Learning phrase representations using RNN encoder-decoder for statistical machine translation

    Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 3

  6. [6]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

  8. [8]

    The” something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

  9. [9]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1, 2

  10. [10]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2, 3, 4

  11. [11]

    Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000– 16009, 2022. 2

  12. [12]

    Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

  13. [13]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean Conference on Computer Vision, pages 237–255. Springer, 2024. 2

  14. [14]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 2

  15. [15]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  16. [16]

    Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023. 2, 5

  17. [17]

    Viorica P ˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo˜ao Carreira, and Razvan Pascanu. Trecvit: A re- current video transformer.arXiv preprint arXiv:2412.14294,

  18. [18]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  19. [19]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, 9 Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Coupri...

  20. [20]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

  21. [21]

    VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems, 35:10078–10093, 2022. 1, 2, 5

  22. [22]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning, pages 10347–10357, 2021. 5

  23. [23]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 5

  24. [24]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753,

  25. [25]

    Recurrent Video Masked Autoencoders

    Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A. Hud- son, Jo˜ao Carreira, and Andrew Zisserman. Recurrent video masked autoencoders.arXiv preprint arXiv:2512.13684,

  26. [26]

    1, 2, 3, 4, 5 10 Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models Supplementary Material The offline evaluation protocol, including readout archi- tectures and training procedures, follows [4, 25]. In this supplementary material, we explain the streaming evalu- ation, which reflects the real-world scenario of receiving video f...

  27. [27]

    Streaming Tasks Table 4 provides an overview of each task, including the training dataset, loss function, evaluation metric, and read- out head parameters. The readout heads are based on the cross-attention architecture from [25], adapted to operate on single-frame tokens (Ntokens per frame) instead of the full spatio-temporal sequence (T×Ntokens). The en...

  28. [28]

    We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup

    Training Settings All tasks share the same training configuration unless stated otherwise. We use AdamW [14] with(β 1, β2) = (0.9,0.999), weight decay10 −4, a cosine learning rate schedule decaying toη min = 10 −7, and linear warmup. Training uses mixed precision (bf16). Our protocol is based on 4DS [4], which used 40K steps for frozen training (only the ...

  29. [29]

    mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames

    Evaluation Metrics Top-1 accuracy (SSv2).Standard classification accuracy on the validation set. mIoU (Waymo).Mean Intersection over Union be- tween predicted and ground-truth bounding boxes, averaged over all objects and frames. Average Jaccard (PT).Following the Perception Test benchmark [16], AJ is defined as the average of Jaccard val- ues at position...