Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

Dongbo Min; Jiwon Yoon; Jiyeong Kim; Yerim So

arxiv: 2605.23288 · v1 · pith:MJSGKBLGnew · submitted 2026-05-22 · 💻 cs.CV

Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

Yerim So , Jiyeong Kim , Jiwon Yoon , Dongbo Min This is my paper

Pith reviewed 2026-05-25 05:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary action recognitionspatio-temporal similarity volumeCLIP transfervideo action recognitionsimilarity aggregationMamba temporal modelingzero-shot learningpatch-level alignment

0 comments

The pith

SimVA builds a 4D similarity volume to transfer CLIP to video action recognition while preserving local patch details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing open-vocabulary action recognition methods first collapse visual features into a global representation and only then compute alignment with text, which erases fine-grained patch information and temporal cues. The paper proposes Similarity Volume Aggregation (SimVA), which instead constructs a dense 4D spatio-temporal similarity volume directly from patch-level visual-text similarities between video tokens and action classes. Class sampling keeps the process scalable, after which spatial aggregation improves intra-frame consistency, motion-aware modulation highlights changing regions, and Mamba-based steps model how similarity patterns evolve over time. By keeping the dense correspondence intact throughout, the method transfers CLIP to video tasks. A reader would care because the approach targets the exact information loss that limits zero-shot and few-shot performance on action benchmarks.

Core claim

The paper claims that constructing a dense 4D spatio-temporal similarity volume over local video tokens and action classes, then refining it via class sampling, spatial aggregation, motion-aware modulation, and Mamba temporal aggregation, preserves local information and enables effective transfer of CLIP to open-vocabulary video action recognition, delivering competitive results on zero-shot, few-shot, and base-to-novel benchmarks.

What carries the argument

The 4D spatio-temporal similarity volume, which stores patch-level visual-text similarities and is refined by successive aggregation modules to maintain dense correspondence.

If this is right

Maintains dense visual-text correspondence at every stage instead of collapsing early.
Achieves competitive zero-shot performance on standard action recognition benchmarks.
Generalizes in few-shot and base-to-novel settings through the same volume construction.
Scales to large vocabularies by sampling classes before aggregation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same volume construction could be tested on other dense prediction video tasks such as temporal action localization.
Motion-aware modulation may prove especially useful in datasets with rapid camera or actor movement.
If the volume remains tractable, it suggests a route to adapt frozen image-text models to video without additional backbone training.

Load-bearing premise

That successive aggregation of the dense similarity volume will preserve local information and scale to large vocabularies without introducing artifacts or accuracy loss.

What would settle it

A benchmark result in which SimVA falls below global-feature baselines on a large-vocabulary zero-shot task or when actions are distinguished only by fine local patch motion.

Figures

Figures reproduced from arXiv: 2605.23288 by Dongbo Min, Jiwon Yoon, Jiyeong Kim, Yerim So.

**Figure 1.** Figure 1: Conceptual comparison of Open-Vocabulary Action Recognition paradigms. (a) Prior methods [33, 24, 15, 23] aggregate visual features into a global representation before computing text alignment. (b) Our method computes patch-level video-text similarities over action classes and organizes them into a spatio-temporal similarity volume. By aggregating this volume in the similarity space, we preserve dense loca… view at source ↗

**Figure 2.** Figure 2: Overall architecture of the SimVA framework. Given an input video and action text prompts, we extract features using CLIP [27]. We first construct a dense 4D spatio-temporal similarity volume (Sec. 2.1) and subsequently select the top-M relevant classes via action class sampling (Sec. 2.2) for efficiency. This volume is then processed through a structured aggregation architecture: the spatial aggregation m… view at source ↗

**Figure 3.** Figure 3: Visualization of motion-aware modulation. Blue arrows visualize the motion offsets r t estimated between adjacent frames. Each arrow is anchored at a spatial patch location; its direction indicates the estimated local displacement direction, and its length represents the offset magnitude. The mean-subtracted offsets suppress global motion trends and highlight local inter-frame variations. within static ima… view at source ↗

**Figure 4.** Figure 4: Qualitative visualizations of similarity volume aggregation on the HMDB-51 dataset. For each example, columns from left to right show the input frame, similarity score S (in (1)), and aggregated similarity volume. The aggregated volume refines noisy patch-level similarities into more spatially coherent action-related responses across frames. motion-relevant regions, and manipulated objects, are more clearl… view at source ↗

read the original abstract

Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity aggregation scalable to large vocabularies. The similarity volume is refined by spatial aggregation, which contextualizes local similarity patterns to improve intra-frame consistency. Motion-aware modulation further injects inter-frame variation cues, highlighting dynamically changing regions. Mamba-based temporal aggregation then models the evolution of class-conditioned similarity patterns across frames. By maintaining dense visual-text correspondence, SimVA effectively transfers CLIP to video action recognition, achieving competitive performance across zero-shot, few-shot, and base-to-novel benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a 4D similarity-volume pipeline with class sampling and staged aggregation to keep local cues when transferring CLIP to open-vocab video, but supplies zero results or analysis so the claims cannot be checked.

read the letter

The core contribution is a constructive pipeline that builds a dense 4D spatio-temporal similarity volume over video patches and classes, then applies class sampling for scale, followed by spatial aggregation, motion-aware modulation, and Mamba temporal steps. This targets the common problem of early global pooling that erases fine-grained cues in prior CLIP-to-video transfers. The exact combination of 4D volume plus that sequence of aggregations looks new on the description given. The approach is presented as a straightforward procedure with no obvious circular definitions or fitted parameters, which is a plus. It correctly identifies the value of preserving patch-level visual-text alignments throughout. The stress-test concern about class sampling discarding critical similarities for large vocabularies lands directly on the abstract: no bound, derivation, or ablation shows that the sampled subset retains the important argmax or top-k matches, and later stages only see the reduced volume. That undercuts the dense-correspondence premise for realistic open-vocab settings. The abstract also states competitive performance on zero-shot, few-shot, and base-to-novel benchmarks yet provides no numbers, tables, or implementation details, so nothing can be verified. This work is aimed at CV researchers already working on open-vocabulary video or similarity-based transfer methods; someone exploring Mamba on structured volumes might extract a design idea, but the lack of evidence limits broader value. I would not bring it to a reading group, would not cite it, and would not send it to peer review until experiments and sampling justification are added.

Referee Report

1 major / 1 minor

Summary. The paper proposes Similarity Volume Aggregation (SimVA) for open-vocabulary action recognition. It constructs a dense 4D spatio-temporal similarity volume over local video tokens and action classes from patch-level CLIP similarities, applies class sampling for scalability to large vocabularies, then refines the volume via spatial aggregation (for intra-frame consistency), motion-aware modulation (for inter-frame cues), and Mamba-based temporal aggregation. The central claim is that this pipeline maintains dense visual-text correspondence, enabling effective CLIP transfer to video and competitive results on zero-shot, few-shot, and base-to-novel benchmarks.

Significance. If the sampling and aggregation steps can be shown to preserve the necessary local alignments, the method would address a common limitation in prior OVAR work (early global aggregation that discards patch-level cues) and provide a constructive, scalable route for dense correspondence in video tasks.

major comments (1)

[Abstract] Abstract (and method description): class sampling is introduced explicitly 'to ensure similarity aggregation scalable to large vocabularies,' yet no derivation, bound, or analysis is supplied showing that the sampled subset retains the argmax or top-k patch-class similarities obtainable from the full vocabulary. Subsequent spatial aggregation, motion-aware modulation, and Mamba steps operate only on the reduced volume; any discarded high-similarity class cannot be recovered. This directly threatens the load-bearing premise that dense visual-text correspondence is maintained for realistic open-vocabulary settings (|C| ≫ 100).

minor comments (1)

[Abstract] Abstract supplies no quantitative results, ablation studies, error analysis, or implementation specifics, so the performance claim cannot be checked against the described pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying this gap in the justification of class sampling. The concern is substantive and directly relevant to the scalability claim. We respond point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and method description): class sampling is introduced explicitly 'to ensure similarity aggregation scalable to large vocabularies,' yet no derivation, bound, or analysis is supplied showing that the sampled subset retains the argmax or top-k patch-class similarities obtainable from the full vocabulary. Subsequent spatial aggregation, motion-aware modulation, and Mamba steps operate only on the reduced volume; any discarded high-similarity class cannot be recovered. This directly threatens the load-bearing premise that dense visual-text correspondence is maintained for realistic open-vocabulary settings (|C| ≫ 100).

Authors: We agree that the current manuscript provides no derivation, probabilistic bound, or empirical analysis demonstrating that the sampled class subset preserves the argmax or top-k patch-class similarities from the full vocabulary. This is a genuine limitation of the submitted version. In the revised manuscript we will (1) explicitly describe the sampling procedure (per-video selection of the K classes with highest mean token similarity), (2) add a new subsection with both theoretical discussion (under a mild assumption on similarity score concentration) and empirical measurements of top-k retention rate on the evaluation vocabularies, and (3) report an ablation that measures the performance drop when sampling is replaced by the full vocabulary on the largest benchmark vocabularies used. These additions will be placed in the method and experimental sections and will directly address whether dense correspondence is preserved after sampling. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive pipeline with no self-referential reductions

full rationale

The paper describes a sequence of explicit construction steps—building a 4D similarity volume from patch-level CLIP similarities, applying class sampling for scalability, then spatial aggregation, motion-aware modulation, and Mamba temporal aggregation—without any equation or claim that reduces a derived quantity to a fitted parameter or prior self-citation by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no prediction is statistically forced by an input fit. The central claim of maintaining dense correspondence is presented as a direct consequence of the described operations rather than an input assumed by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5697 in / 1030 out tokens · 26550 ms · 2026-05-25T05:11:55.446343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021

work page 2021
[2]

A Short Note about Kinetics-600

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600, 2018. arXiv preprint arXiv:1808.01340

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

T. Chen, H. Yu, Z. Yang, Z. Li, W. Sun, and C. Chen. Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition. InCVPR, pages 18888–18898, 2024

work page 2024
[4]

W. Chen, H. Xu, Z. Zhou, Y . Liu, B. Sun, W. Kang, and X. Xie. Costformer: Cost transformer for cost aggregation in multi-view stereo, 2023. arXiv preprint arXiv:2305.10320

work page arXiv 2023
[5]

S. Cho, S. Hong, S. Jeon, Y . Lee, K. Sohn, and S.-W. Kim. Cats: Cost aggregation transformers for visual correspondence. InNeurIPS, pages 9011–9023, 2021

work page 2021
[6]

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S.-W. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. InCVPR, pages 4113–4123, 2024

work page 2024
[7]

Ghiasi, X

G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin. Scaling open-vocabulary image segmentation with image-level labels. InECCV, pages 540–557, 2022

work page 2022
[8]

something something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. S. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The "something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017

work page 2017
[9]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In COLM, 2024

work page 2024
[10]

S. Hong, S. Cho, J. Nam, S. Lin, and S.-W. Kim. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. InECCV, pages 108–126, 2022

work page 2022
[11]

Huang, H

X. Huang, H. Zhou, K. Yao, and K. Han. Froster: Frozen clip is a strong teacher for open- vocabulary action recognition, 2024. arXiv preprint arXiv:2402.03241

work page arXiv 2024
[12]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y . Wu, Z. Chen, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021

work page 2021
[13]

C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual-language models for efficient video understanding. InECCV, pages 105–124, 2022

work page 2022
[14]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset,

work page
[15]

arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv
[16]

M. Kim, D. Han, T. Kim, and B. Han. Leveraging temporal contextualization for video action recognition. InECCV, pages 74–91, 2024

work page 2024
[17]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. InICCV, pages 2556–2563, 2011

work page 2011
[18]

J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022

work page 2022
[19]

K. Li, X. Li, Y . Wang, Y . He, Y . Wang, L. Wang, and Y . Qiao. Videomamba: State space model for efficient video understanding. InECCV, pages 237–255, 2024

work page 2024
[20]

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, and J. Gao. Grounded language-image pre-training. InCVPR, pages 10965–10975, 2022

work page 2022
[21]

Liang, B

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, pages 7061–7070, 2023. 10

work page 2023
[22]

W. Lin, L. Karlinsky, N. Shvetsova, H. Possegger, M. Kozinski, R. Panda, and H. Bischof. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. InICCV, pages 2851–2862, 2023

work page 2023
[23]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021

work page 2021
[24]

F. Long, X. Li, J. Lv, H. Yang, X. Cheng, and P. Li. Bdc-clip: Brownian distance covariance for adapting clip to action recognition. InICML, 2025

work page 2025
[25]

Y . Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. InACM MM, pages 638–647, 2022

work page 2022
[26]

Miech, D

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InICCV, pages 2630–2640, 2019

work page 2019
[27]

J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li. St-adapter: Parameter-efficient image-to-video transfer learning. InNeurIPS, pages 26462–26477, 2022

work page 2022
[28]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021

work page 2021
[29]

Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language- guided dense prediction with context-aware prompting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18082–18091, 2022

work page 2022
[30]

Rasheed, M

H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan. Fine-tuned clip models are efficient video learners. InCVPR, pages 6545–6554, 2023

work page 2023
[31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. arXiv preprint arXiv:1212.0402

work page internal anchor Pith review Pith/arXiv arXiv 2012
[32]

Teed and J

Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419, 2020

work page 2020
[33]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Zhai, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, and L. Beyer. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. arXiv preprint arXiv:2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

M. Wang, J. Xing, and Y . Liu. Actionclip: A new paradigm for video action recognition, 2021. arXiv preprint arXiv:2109.08472

work page arXiv 2021
[35]

Wasim, S

M. Wasim, S. Khan, F. S. Khan, and M. Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InCVPR, pages 19606–19616, 2023

work page 2023
[36]

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang. Open-vclip: Transforming clip to an open- vocabulary video model via interpolated weight optimization. InICML, pages 36978–36989, 2023

work page 2023
[37]

H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, and C. Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InEMNLP, pages 6787–6800, 2021

work page 2021
[38]

T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li. Aim: Adapting image models for efficient video action recognition, 2023. arXiv preprint arXiv:2302.03024

work page arXiv 2023
[39]

L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Representations, 2022

work page 2022
[40]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 11

work page 2023
[41]

Zhang, C

W. Zhang, C. Wan, T. Liu, X. Tian, X. Shen, and J. Ye. Enhanced motion-text alignment for image-to-video transfer learning. InCVPR, pages 18504–18515, 2024

work page 2024
[42]

Zhong, J

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, and J. Gao. Regionclip: Region-based language-image pretraining. InCVPR, pages 16793–16803, 2022

work page 2022
[43]

A video of {}

X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra. Detecting twenty-thousand classes using image-level supervision. InECCV, pages 350–368, 2022. 12 Appendix Overview We provide additional details in this appendix, organized as follows: •Sec. A:Detailed architecture of the aggregation modules. •Sec. B:Robustness of our method to training frame va...

work page 2022

[1] [1]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021

work page 2021

[2] [2]

A Short Note about Kinetics-600

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600, 2018. arXiv preprint arXiv:1808.01340

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

T. Chen, H. Yu, Z. Yang, Z. Li, W. Sun, and C. Chen. Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition. InCVPR, pages 18888–18898, 2024

work page 2024

[4] [4]

W. Chen, H. Xu, Z. Zhou, Y . Liu, B. Sun, W. Kang, and X. Xie. Costformer: Cost transformer for cost aggregation in multi-view stereo, 2023. arXiv preprint arXiv:2305.10320

work page arXiv 2023

[5] [5]

S. Cho, S. Hong, S. Jeon, Y . Lee, K. Sohn, and S.-W. Kim. Cats: Cost aggregation transformers for visual correspondence. InNeurIPS, pages 9011–9023, 2021

work page 2021

[6] [6]

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S.-W. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. InCVPR, pages 4113–4123, 2024

work page 2024

[7] [7]

Ghiasi, X

G. Ghiasi, X. Gu, Y . Cui, and T.-Y . Lin. Scaling open-vocabulary image segmentation with image-level labels. InECCV, pages 540–557, 2022

work page 2022

[8] [8]

something something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. S. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The "something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017

work page 2017

[9] [9]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In COLM, 2024

work page 2024

[10] [10]

S. Hong, S. Cho, J. Nam, S. Lin, and S.-W. Kim. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. InECCV, pages 108–126, 2022

work page 2022

[11] [11]

Huang, H

X. Huang, H. Zhou, K. Yao, and K. Han. Froster: Frozen clip is a strong teacher for open- vocabulary action recognition, 2024. arXiv preprint arXiv:2402.03241

work page arXiv 2024

[12] [12]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y . Wu, Z. Chen, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021

work page 2021

[13] [13]

C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual-language models for efficient video understanding. InECCV, pages 105–124, 2022

work page 2022

[14] [14]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset,

work page

[15] [15]

arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

M. Kim, D. Han, T. Kim, and B. Han. Leveraging temporal contextualization for video action recognition. InECCV, pages 74–91, 2024

work page 2024

[17] [17]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. InICCV, pages 2556–2563, 2011

work page 2011

[18] [18]

J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022

work page 2022

[19] [19]

K. Li, X. Li, Y . Wang, Y . He, Y . Wang, L. Wang, and Y . Qiao. Videomamba: State space model for efficient video understanding. InECCV, pages 237–255, 2024

work page 2024

[20] [20]

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, and J. Gao. Grounded language-image pre-training. InCVPR, pages 10965–10975, 2022

work page 2022

[21] [21]

Liang, B

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, pages 7061–7070, 2023. 10

work page 2023

[22] [22]

W. Lin, L. Karlinsky, N. Shvetsova, H. Possegger, M. Kozinski, R. Panda, and H. Bischof. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. InICCV, pages 2851–2862, 2023

work page 2023

[23] [23]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021

work page 2021

[24] [24]

F. Long, X. Li, J. Lv, H. Yang, X. Cheng, and P. Li. Bdc-clip: Brownian distance covariance for adapting clip to action recognition. InICML, 2025

work page 2025

[25] [25]

Y . Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. InACM MM, pages 638–647, 2022

work page 2022

[26] [26]

Miech, D

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InICCV, pages 2630–2640, 2019

work page 2019

[27] [27]

J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li. St-adapter: Parameter-efficient image-to-video transfer learning. InNeurIPS, pages 26462–26477, 2022

work page 2022

[28] [28]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021

work page 2021

[29] [29]

Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language- guided dense prediction with context-aware prompting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18082–18091, 2022

work page 2022

[30] [30]

Rasheed, M

H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan. Fine-tuned clip models are efficient video learners. InCVPR, pages 6545–6554, 2023

work page 2023

[31] [31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. arXiv preprint arXiv:1212.0402

work page internal anchor Pith review Pith/arXiv arXiv 2012

[32] [32]

Teed and J

Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419, 2020

work page 2020

[33] [33]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Zhai, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, and L. Beyer. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. arXiv preprint arXiv:2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

M. Wang, J. Xing, and Y . Liu. Actionclip: A new paradigm for video action recognition, 2021. arXiv preprint arXiv:2109.08472

work page arXiv 2021

[35] [35]

Wasim, S

M. Wasim, S. Khan, F. S. Khan, and M. Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InCVPR, pages 19606–19616, 2023

work page 2023

[36] [36]

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang. Open-vclip: Transforming clip to an open- vocabulary video model via interpolated weight optimization. InICML, pages 36978–36989, 2023

work page 2023

[37] [37]

H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, and C. Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InEMNLP, pages 6787–6800, 2021

work page 2021

[38] [38]

T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li. Aim: Adapting image models for efficient video action recognition, 2023. arXiv preprint arXiv:2302.03024

work page arXiv 2023

[39] [39]

L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Representations, 2022

work page 2022

[40] [40]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 11

work page 2023

[41] [41]

Zhang, C

W. Zhang, C. Wan, T. Liu, X. Tian, X. Shen, and J. Ye. Enhanced motion-text alignment for image-to-video transfer learning. InCVPR, pages 18504–18515, 2024

work page 2024

[42] [42]

Zhong, J

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, and J. Gao. Regionclip: Region-based language-image pretraining. InCVPR, pages 16793–16803, 2022

work page 2022

[43] [43]

A video of {}

X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra. Detecting twenty-thousand classes using image-level supervision. InECCV, pages 350–368, 2022. 12 Appendix Overview We provide additional details in this appendix, organized as follows: •Sec. A:Detailed architecture of the aggregation modules. •Sec. B:Robustness of our method to training frame va...

work page 2022