pith. sign in

arxiv: 2606.03837 · v1 · pith:CRMBX2IGnew · submitted 2026-06-02 · 💻 cs.CV

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

Pith reviewed 2026-06-28 10:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient fine-tuningvideo adaptationtemporal contextlow-resource learningPEFTprobingvideo understandingfoundation models
0
0 comments X

The pith

Temporal context should be distributed across backbone, PEFT and probe instead of confined to one component for low-resource video adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how to adapt foundation models to video tasks when annotations and compute are limited. It compares image-pretrained backbones against video representations and tests where temporal reasoning is placed: in the backbone, the parameter-efficient adapter, or the task probe. Experiments cover appearance-focused, motion-focused and spatially dense tasks under low-data regimes. The results indicate that allocation choices affect adaptation success and that confining temporal processing to a single part restricts performance.

Core claim

Systematic comparison of PEFT and probing strategies shows that temporal context allocation across the backbone, the adapter modules and the probe determines adaptation effectiveness in low-resource video settings, with the best placement depending on whether tasks emphasize appearance, motion or spatial density.

What carries the argument

Distribution of temporal reasoning across backbone, PEFT adapter and task probe during model adaptation.

If this is right

  • Confining temporal reasoning to only one model component reduces adaptation performance across the tested task categories.
  • Different task types benefit from temporal context placed in different locations among backbone, PEFT and probe.
  • Direct comparison of image-pretrained and video-pretrained representations under the same PEFT regimes reveals setting-specific differences.
  • In data-scarce conditions, selective temporal allocation improves parameter-efficient adaptation more than uniform placement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future video foundation models could expose separate temporal modules at each stage to allow task-specific allocation.
  • The allocation principle may extend to other temporal modalities such as audio or time-series sensor data.
  • Similar studies on larger-scale models or additional low-resource regimes could test whether the observed patterns persist.

Load-bearing premise

The chosen appearance-focused, motion-focused and spatially dense tasks together with the tested low-resource regimes are representative enough for the allocation conclusions to generalize.

What would settle it

Re-running the adaptation experiments on a new video dataset or task type outside the evaluated set and finding that the previously optimal temporal allocations no longer produce the best results.

Figures

Figures reproduced from arXiv: 2606.03837 by Hazel Doughty, Luc P.J. Str\"ater.

Figure 1
Figure 1. Figure 1: Distributing Temporal Context. Existing works place temporal modeling in a single component: either the PEFT modules (a), where a frame-wise backbone is combined with temporal adaptation, or in the backbone (b), where temporally-aware representations are adapted by static PEFT and probes. We instead allow temporal modeling across backbone, PEFT and probe simultaneously (c) and study how temporal context sh… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Probe Methods for Video. In isolation, the best probe varies per task. Combining probe and PEFT reduces this variability, with attentive probe performing best overall. effective across task types. More broadly, we find conclusions from other domains do not necessarily transfer to video; for example, DoRA does not outperform LoRA, despite being more recent. Conclusion. Low-rank and adapter met… view at source ↗
Figure 3
Figure 3. Figure 3: Disentangling Temporal Modeling from Input Frames. Performance-throughput trade￾offs when varying temporal context jointly with the input (matched) or only within the backbone, PEFT, and probe while keeping the input fixed at 16 frames (input-fixed). This separates performance changes due to temporal-modeling and input-information. 5.4 Do Input Frames or Temporal Modeling Matter More? Experimental Setup. H… view at source ↗
Figure 4
Figure 4. Figure 4: How to Distribute Temporal Context. Fixing one component while varying the other two at Tc isolates each component’s contribution to temporal modeling. Curves above input-fixed indicate that the corresponding component improves with increased temporal context. the reference regardless of Tc, while reducing it drops performance sharply. When Tbackbone<T, allocating more temporal-context to the PEFT modules … view at source ↗
Figure 5
Figure 5. Figure 5: PEFT vs. Full-Finetuning. PEFT methods outperform the attentive probe on every task with fewer trainable parameters, and exceed full finetuning on 5 of 6 tasks. Notably, general PEFT methods (LoRA, AdaptFormer) perform on par with or better than the video-specific ST-Adapter. All methods adapt an InternVideo-Next ViT-B backbone with 2048 tokens per video [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Disentangling temporal modeling from input information. Decoupling the input frame count from the internal temporal context of the backbone, PEFT, and probe separates performance drops into a temporal-modeling component and an input-information component. This is the per￾dataset version of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Where to allocate temporal context. Fixing one component while varying the other two at Tc isolates each component’s contribution to temporal modeling. Curves above input-fixed indicate that the corresponding component improves with increased temporal context. This is the per-dataset version of [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a systematic empirical study of parameter-efficient fine-tuning (PEFT) and probing strategies for adapting foundation models to low-resource video understanding tasks. It compares image-pretrained and video models, evaluates adaptation across appearance-focused, motion-focused, and spatially dense settings, and examines how temporal context should be allocated among the backbone, PEFT modules, and probe.

Significance. If the empirical results hold under broader validation, the work supplies actionable guidance on temporal context allocation for efficient video adaptation in data-scarce regimes, an area where annotation and compute costs are high. It bridges PEFT literature from images to video and highlights setting-dependent differences in adaptation effectiveness.

major comments (1)
  1. [Evaluation and Results sections] The central claim that the study demonstrates the 'importance of temporal context allocation' for effective video adaptation rests on the representativeness of the selected appearance-focused, motion-focused, and spatially dense tasks plus the specific low-resource regimes. The manuscript provides no explicit diversity arguments, controls for motion-versus-appearance emphasis, or cross-dataset validation to show these choices generalize beyond the evaluated models and datasets; this directly affects whether the allocation insights are setting-specific or broadly applicable.
minor comments (1)
  1. [Abstract] The abstract states that the study supplies 'new insights' and 'demonstrate[s] the importance' of temporal allocation but contains no quantitative results, error bars, dataset sizes, or exclusion criteria, which delays assessment of claim strength until the full text is read.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Evaluation and Results sections] The central claim that the study demonstrates the 'importance of temporal context allocation' for effective video adaptation rests on the representativeness of the selected appearance-focused, motion-focused, and spatially dense tasks plus the specific low-resource regimes. The manuscript provides no explicit diversity arguments, controls for motion-versus-appearance emphasis, or cross-dataset validation to show these choices generalize beyond the evaluated models and datasets; this directly affects whether the allocation insights are setting-specific or broadly applicable.

    Authors: We acknowledge that the manuscript does not include explicit cross-dataset validation, additional controls isolating motion-versus-appearance emphasis, or formal diversity arguments for task selection. The evaluated tasks were chosen as standard benchmarks representative of the three categories (appearance-focused, motion-focused, and dense), following categorizations common in prior video understanding literature. We will revise the Evaluation section to add a dedicated paragraph providing these justifications, citing relevant prior works on task categorization, and explicitly framing the temporal allocation insights as setting-dependent rather than claiming broad generalizability. This addresses the concern through added discussion without new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This is an empirical study that evaluates PEFT, probing, and temporal context allocation across appearance-focused, motion-focused, and spatially dense video tasks in low-resource regimes. No equations, derivations, or self-referential reductions are present; claims rest on experimental comparisons rather than any fitted parameter renamed as prediction or self-citation chain that collapses to the input. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical ablation study; no mathematical derivations or new entities introduced. Relies on standard machine-learning evaluation assumptions.

axioms (1)
  • domain assumption Standard i.i.d. train/val/test splits and task definitions in video understanding benchmarks are sufficient to draw allocation conclusions.
    Typical for CV empirical papers; invoked implicitly when claiming general insights from chosen tasks.

pith-pipeline@v0.9.1-grok · 5677 in / 1117 out tokens · 18007 ms · 2026-06-28T10:28:56.907821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021

  2. [2]

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew J. Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

  3. [3]

    Revisiting feature prediction for learning visual representations from video.Transactions on Machine Learning Research, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.Transactions on Machine Learning Research, 2024

  4. [4]

    Strong baselines for parameter- efficient few-shot fine-tuning

    Samyadeep Basu, Shell Hu, Daniela Massiceti, and Soheil Feizi. Strong baselines for parameter- efficient few-shot fine-tuning. InAAAI Conference on Artificial Intelligence (AAAI), pages 11024–11031, 2024

  5. [5]

    Peft-bench: A parameter- efficient fine-tuning methods benchmark

    Robert Belanec, Branislav Pecher, Ivan Srba, and Maria Bielikova. Peft-bench: A parameter- efficient fine-tuning methods benchmark. InConference of the European Chapter of the Association for Computational Linguistics (EACL), pages 3035–3054, 2026

  6. [6]

    Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824. PMLR, 2021

  7. [7]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Koulako Bala Doumbouya, Esin Du...

  8. [8]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021. 10

  9. [9]

    Hudson, Pedro V’elez, Luisa F

    João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahen- dran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V’elez, Luisa F. Polan’ia, Luke Friedman, Chris Duvarney, Ross Gorosh...

  10. [10]

    Vl-jepa: Joint embedding predictive architecture for vision-language

    Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language. InInternational Conference on Learning Representations (ICLR), 2026

  11. [11]

    Adaptformer: Adapting vision transformers for scalable visual recognition

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16664–16678, 2022

  12. [12]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22831–22840, 2025

  13. [13]

    Context autoencoder for self-supervised representation learning.International Journal of Computer Vision (IJCV), 132(1):208–223, 2024

    Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision (IJCV), 132(1):208–223, 2024

  14. [14]

    Randaugment: Practical automated data augmentation with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 702–703, 2020

  15. [15]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  16. [16]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

  17. [17]

    Parameter- efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5 (3):220–235, 2023

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen, Yang Liu, Jie Tang, Juanzi Li, and Maosong Sun. Parameter- efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5 ...

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  19. [19]

    Scalable pre-training of large autoregressive image models

    Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Vaishaal Shankar, Alexander Toshev, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models. InInternational Conference on Machine Learning (ICML), pages 12371–12384. PMLR, 2024

  20. [20]

    Mammalps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps

    Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbul, Alexander Mathis, and Devis Tuia. Mammalps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13854–13864, 2025

  21. [21]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The" something something" video database for learning and evaluating visual common sense. InIEEE International Confer...

  22. [22]

    Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, and Andrew Zisserman

    Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, and Andrew Zisserman. Scivid: Cross-domain evaluation of video models in scientific applications. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 21800–...

  23. [23]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning (ICML), pages 2790–2799. PMLR, 2019

  24. [24]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  25. [25]

    Nurvid: A large expert-level video database for nursing procedure activity understanding

    Ming Hu, Lin Wang, Siyuan Yan, Don Ma, Qingli Ren, Peng Xia, Wei Feng, Peibo Duan, Lie Ju, and Zongyuan Ge. Nurvid: A large expert-level video database for nursing procedure activity understanding. InAdvances in Neural Information Processing Systems (NeurIPS), pages 18146–18164, 2023

  26. [26]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision (ECCV), pages 709–727. Springer, 2022

  27. [27]

    Fact: Factor-tuning for lightweight adaptation on vision transformer

    Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. InAAAI Conference on Artificial Intelligence (AAAI), pages 1060–1068, 2023

  28. [28]

    Convolutional bypasses are better vision transformer adapters

    Shibo Jie, Zhi-Hong Deng, Shixuan Chen, and Zhijuan Jin. Convolutional bypasses are better vision transformer adapters. InEuropean Conference on Artificial Intelligence (ECAI). IOS Press, 2024

  29. [29]

    The Kinetics Human Action Video Dataset

    Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Apostol Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  30. [30]

    Do better imagenet models transfer better? InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

    Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

  31. [31]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. InIEEE International Conference on Computer Vision (ICCV), pages 2556–2563, 2011

  32. [32]

    Context-aware emotion recognition networks

    Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. Context-aware emotion recognition networks. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10143–10152, 2019

  33. [33]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021

  34. [34]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 4582–4597, 2021

  35. [35]

    Videoeval: Compre- hensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

    Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, and Limin Wang. Videoeval: Compre- hensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

  36. [36]

    Scaling & shifting your features: A new baseline for efficient model tuning

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 109–123, 2022. 12

  37. [37]

    Frozen clip models are efficient video learners

    Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. InEuropean Conference on Computer Vision (ECCV), pages 388–404. Springer, 2022

  38. [38]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In International Conference on Machine Learning (ICML). PMLR, 2024

  39. [39]

    From static to dynamic: Exploring self-supervised image-to-video representation transfer learning

    Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, and Qingming Huang. From static to dynamic: Exploring self-supervised image-to-video representation transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  40. [40]

    Towards efficient visual adaption via structural re-parameterization.arXiv preprint arXiv:2302.08106, 2023

    Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji. Towards efficient visual adaption via structural re-parameterization.arXiv preprint arXiv:2302.08106, 2023

  41. [41]

    Lessons and insights from a unifying study of parameter-efficient fine- tuning (peft) in visual recognition

    Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Quang-Huy Nguyen, Li Zhang, and Wei-Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine- tuning (peft) in visual recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14845–14857, 2025

  42. [42]

    Vspw: A large-scale dataset for video scene parsing in the wild

    Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4133–4143, 2021

  43. [43]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  44. [44]

    St-adapter: Parameter- efficient image-to-video transfer learning

    Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter- efficient image-to-video transfer learning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 26462–26477, 2022

  45. [45]

    Dual-path adaptation from image to video transformers

    Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Dual-path adaptation from image to video transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2203–2213, 2023

  46. [46]

    Adapterfusion: Non-destructive task composition for transfer learning

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InConference of the European Chapter of the Association for Computational Linguistics (EACL), pages 487–503, 2021

  47. [47]

    Frame2freq: Spectral adapters for fine-grained video understanding.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

    Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, and Alina Roitberg. Frame2freq: Spectral adapters for fine-grained video understanding.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  48. [48]

    Attention, please! revisiting attentive probing through the lens of efficiency

    Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, and Giorgos Tolias. Attention, please! revisiting attentive probing through the lens of efficiency. InInternational Conference on Learning Representations (ICLR), 2026

  49. [49]

    Disentangling spatial and temporal learning for efficient image-to-video transfer learning

    Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 13934–13944, 2023

  50. [50]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 13

  51. [51]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021

  52. [52]

    Imagenet-21k pretraining for the masses

    Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  53. [53]

    Schoonbeek, Tim Houben, Hans Onvlee, Peter H.N

    Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H.N. de With, and Fons van der Sommen. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. InIEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 4365–4374, 2024

  54. [54]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

  55. [55]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seung Eun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel HAZIZA, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Juli...

  56. [56]

    How severe is benchmark-sensitivity in video self-supervised learning? InEuropean Conference on Computer Vision (ECCV), pages 632–652

    Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, and Cees GM Snoek. How severe is benchmark-sensitivity in video self-supervised learning? InEuropean Conference on Computer Vision (ECCV), pages 632–652. Springer, 2022

  57. [57]

    Severe++: Evaluating benchmark sensitivity in generalization of video representation learning.arXiv preprint arXiv:2504.05706, 2025

    Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, and Cees GM Snoek. Severe++: Evaluating benchmark sensitivity in generalization of video representation learning.arXiv preprint arXiv:2504.05706, 2025

  58. [58]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  59. [59]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H’enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features...

  60. [60]

    Canonical rank adaptation: An efficient fine-tuning strategy for vision transformers

    Lokesh Veeramacheneni, Moritz Wolter, Hilde Kuehne, and Juergen Gall. Canonical rank adaptation: An efficient fine-tuning strategy for vision transformers. InInternational Conference on Machine Learning (ICML). PMLR, 2025

  61. [61]

    Internvideo-next: Towards world-understanding video models

    Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, and Limin Wang. Internvideo-next: Towards world-understanding video models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16966–16976, 2026

  62. [62]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

  63. [63]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022

  64. [64]

    Internvideo2: Scaling foundation models for 14 multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Jilan Xu, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for 14 multimodal video understanding. InEuropean Conference on Computer Vision (ECCV), pages 396–416...

  65. [65]

    Attention to the burstiness in visual prompt tuning! InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4253–4263, 2025

    Yuzhu Wang, Manni Duan, and Shu Kong. Attention to the burstiness in visual prompt tuning! InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4253–4263, 2025

  66. [66]

    Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning

    Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4230– 4239, 2023

  67. [67]

    V-petl bench: A unified visual parameter- efficient transfer learning benchmark

    Yi Xin, Siqi Luo, Xuyang Liu, Yuntao Du., Haodi Zhou, Xinyu Cheng, Christina Luoluo Lee, Junlong Du, Haozhe Wang, MingCai Chen, Ting Liu, Guimin Hu, Zhongwei Wan, Rongchao Zhang, Aoxue Li, Mingyang Yi, and Xiaohong Liu. V-petl bench: A unified visual parameter- efficient transfer learning benchmark. InAdvances in Neural Information Processing Systems (Neu...

  68. [68]

    Parameter- efficient fine-tuning for pre-trained vision models: A survey and benchmark.arXiv preprint arXiv:2402.02242, 2024

    Yi Xin, Jianjiang Yang, Siqi Luo, Yuntao Du, Qi Qin, Kangrui Cen, Yangfan He, Zhiwei Zhang, Bin Fu, Xiaokang Yang, Guangtao Zhai, Ming-Hsuan Yang, and Xiaohong Liu. Parameter- efficient fine-tuning for pre-trained vision models: A survey and benchmark.arXiv preprint arXiv:2402.02242, 2024

  69. [69]

    Aim: Adapting image models for efficient video action recognition

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition. InInternational Conference on Learning Representations (ICLR), 2023

  70. [70]

    VideoGLUE: Video general understanding evaluation of foundation models.Transactions on Machine Learning Research, 2024

    Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, and Boqing Gong. VideoGLUE: Video general understanding evaluation of foundation models.Transactions on Machine Learning Resea...

  71. [71]

    Bitfit: Simple parameter-efficient fine- tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine- tuning for transformer-based masked language-models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

  72. [72]

    Visual fourier prompt tuning

    Runjia Zeng, Cheng Han, Qifan Wang, Chunshu Wu, Tong Geng, Lifu Huang, Ying N Wu, and Dongfang Liu. Visual fourier prompt tuning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5552–5585, 2024

  73. [73]

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, André Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. A large-scale study of representation learning with the visual task ad...

  74. [74]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations (ICLR), 2018

  75. [75]

    Adalora: Adaptive budget allocation for parameter- efficient fine-tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning. InInternational Conference on Learning Representations (ICLR), 2023

  76. [76]

    Low-resource vision challenges for foundation models

    Yunhua Zhang, Hazel Doughty, and Cees GM Snoek. Low-resource vision challenges for foundation models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21956–21966, 2024

  77. [77]

    Dynamic tuning towards parameter and inference efficiency for vit adaptation

    Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, and Yang You. Dynamic tuning towards parameter and inference efficiency for vit adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pages 114765–114796, 2024

  78. [78]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InAAAI Conference on Artificial Intelligence (AAAI), pages 13001–13008, 2020. 15

  79. [79]

    Recurrent video masked autoencoders

    Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, João Carreira, and Andrew Zisserman. Recurrent video masked autoencoders. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17744–17755, 2026. A Experiment Details A.1 Backbone All experiments use a InternVideo-Next [61] ViT-B backbone with patch size 14 and input res...