Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Chaolei Han; Hongsong Wang; Jidong Kuang; Jie Gui; Lei Zhang

arxiv: 2501.13795 · v1 · pith:OOUWVM2Cnew · submitted 2025-01-23 · 💻 cs.CV

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Chaolei Han , Hongsong Wang , Jidong Kuang , Lei Zhang , Jie Gui This is my paper

Pith reviewed 2026-05-23 04:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot temporal action detectionvision-language modelstraining-free methodLogOIC scoretest-time adaptationtemporal action localization

0 comments

The pith

Pre-trained vision-language models can directly detect unseen actions in untrimmed videos without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training-free zero-shot temporal action detection method called FreeZAD that uses existing vision-language models to classify and localize unseen activities in videos. It introduces the LogOIC score and frequency-based calibration to handle proposals without needing explicit temporal modeling or high-quality pseudo-labels. This approach avoids the domain shifts and high computational costs of training-based methods. Experiments show it outperforms state-of-the-art unsupervised methods on THUMOS14 and ActivityNet-1.3 while using only 1/13 of the runtime. Adding test-time adaptation with prototype-centric sampling further closes the gap to fully supervised methods.

Core claim

Existing vision-language models can be leveraged directly for zero-shot temporal action detection without additional fine-tuning by using a logarithmic decay weighted outer-inner contrastive score to evaluate action proposals and frequency-based actionness calibration, leading to better performance than unsupervised methods at a fraction of the runtime.

What carries the argument

The Logarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration that enable direct use of ViL models for classifying and localizing unseen actions.

If this is right

The method requires no training, reducing runtime to 1/13 of previous unsupervised approaches.
It outperforms state-of-the-art unsupervised zero-shot temporal action detection methods on standard datasets.
Equipping it with test-time adaptation using prototype-centric sampling narrows the performance gap with fully supervised methods.
The approach mitigates issues from domain shifts and dependence on pseudo-label quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar calibration techniques could be tested on other video understanding tasks that use pre-trained models.
This suggests that vision-language models encode sufficient temporal structure for action localization even without explicit training on video data.

Load-bearing premise

Pre-trained vision-language models already contain enough knowledge to classify and localize unseen actions when the right scoring and calibration functions are applied to video proposals.

What would settle it

Running the method on THUMOS14 and finding that its mean average precision is lower than current unsupervised methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2501.13795 by Chaolei Han, Hongsong Wang, Jidong Kuang, Jie Gui, Lei Zhang.

**Figure 1.** Figure 1: We propose the first training-free method for ZSTAD, distinguishing it from all existing state-of-the-art methods, which are training-based with varying degrees of supervision. 1. Introduction With the development of social media and surveillance systems, video understanding has become more important. As a fundamental task in video understanding, temporal action detection (TAD) [38, 43] aims to recognize… view at source ↗

**Figure 2.** Figure 2: Overall architecture of our proposed training-free zero-shot temporal action detection network. The model recognizes and localizes unseen activities within untrimmed videos with only a single forward pass. Specifically, video-level label is first generated through visual embedding and textual encoding derived from the visual and textual backbones of ViL models. Then, segments are selected by calculating th… view at source ↗

**Figure 3.** Figure 3: An illustration of Logarithmic Decay Weighted Outer-Inner-Contrastive Score (LogOIC). The green mask and the red mask respectively cover the inner and outer activations within a segment. The adjusted outer activation, shown in orange, are obtained by reweighting the blue activation with a logarithmic decay weight. with its textual feature. Building on this consideration, the cosine similarity is directly c… view at source ↗

**Figure 4.** Figure 4: Illustration of zero-shot temporal action detection with test-time adaptation (AdaZAD). This paradigm seeks to adapt the pre-trained ViL models for TAD without annotations, and the detected results are obtained after completing all adaptation steps. Prototype-Centric Sampling (PCS) focuses on selecting positive samples surrounding the highest similarity values to the textual features. segment-level calibr… view at source ↗

**Figure 5.** Figure 5: The left plot shows the average mAP across various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Error analysis of our FreeZAD. The left is false positive profile and the right is removing error impact, where G denotes the number of ground truths. Only the Top-1G to Top-3G predictions are provided due to the sparsity of the predictions. The enhanced model with a designed test-time adaptation strategy further outperforms the performance and narrows the gap with fully supervised methods. Extensive ab… view at source ↗

read the original abstract

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A training-free ZSTAD pipeline using ViL models with LogOIC and PCS claims to beat unsupervised baselines at 1/13 runtime on THUMOS14 and ActivityNet.

read the letter

The main point is that this paper gives a training-free way to do zero-shot temporal action detection by feeding untrimmed videos straight into pre-trained vision-language models and post-processing their outputs with a logarithmic outer-inner contrast score plus frequency calibration. It also adds a test-time adaptation step via prototype-centric sampling that narrows the gap to supervised methods. The reported result is better numbers than prior unsupervised work at roughly one-thirteenth the runtime on the two standard datasets. That runtime saving is the concrete practical angle here. The components are clearly named and motivated by the problems of domain shift and pseudo-label noise in earlier training-based ZSTAD work, so the design choices are easy to follow. The paper does a straightforward job showing why skipping training altogether helps deployment. The experiments appear to rest on direct comparisons to external baselines rather than self-referential fitting. Soft spots are mostly around verification: the abstract and method description do not spell out exact metric definitions or baseline re-implementations, so it is hard to judge how much the gains depend on the particular ViL backbone or video preprocessing. The zero-shot claim could also be sensitive to overlap between the ViL pre-training data and the test actions, though the paper does not appear to hide that issue. Generalization beyond the two datasets is not heavily tested. This is aimed at people working on efficient video models or zero-shot CV pipelines who care about runtime. A reader who wants to try the scoring functions or the PCS sampling would get usable ideas from it. The work shows clear engagement with the literature on the practical barriers, so it deserves a serious referee to check the experimental details and robustness. I would send it to review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes FreeZAD, a training-free zero-shot temporal action detection method that leverages pre-trained vision-language models to classify and localize unseen actions in untrimmed videos. It introduces the Logarithmic decay weighted Outer-Inner-Contrastive (LogOIC) score and frequency-based Actionness Calibration to avoid explicit temporal modeling and pseudo-labels, plus a test-time adaptation (TTA) strategy with Prototype-Centric Sampling (PCS). Experiments on THUMOS14 and ActivityNet-1.3 claim outperformance over state-of-the-art unsupervised ZSTAD methods at 1/13 the runtime, with TTA further narrowing the gap to fully supervised approaches.

Significance. If the empirical claims hold under full verification, the work would be significant for demonstrating that fixed pre-trained ViL models plus lightweight post-processing can deliver competitive ZSTAD performance without any training or adaptation, substantially lowering computational barriers compared to prior unsupervised methods.

major comments (3)

[Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.
[§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.
[TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.

minor comments (1)

[§3.3] Notation for Actionness Calibration is introduced without a clear link to the frequency-based formula; a short equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that enhance verifiability and clarity without altering the core claims.

read point-by-point responses

Referee: [Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.

Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript we will add tables reporting per-class mAP, explicitly list the IoU thresholds used (standard 0.1–0.5 for THUMOS14 and 0.5–0.95 for ActivityNet), document the precise baseline implementations and code references, and include measured wall-clock times on the same hardware used for all methods. These additions will directly support the reported outperformance and runtime claims. revision: yes
Referee: [§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.

Authors: Section 3 already presents LogOIC as a parameter-free score computed from ViL logits, but we acknowledge the derivation could be more explicit. We will insert a dedicated equation block and step-by-step derivation in the revised §3 that shows how the outer-inner contrastive term is obtained from the logits and how the logarithmic decay is applied, thereby confirming reproducibility and the mechanism for reducing domain-shift sensitivity. revision: yes
Referee: [TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.

Authors: We will add a new ablation subsection that isolates the effect of Prototype-Centric Sampling (PCS) on the base FreeZAD pipeline and includes quantitative analysis of sampling bias across unseen classes. This will provide direct evidence for the contribution of the TTA component. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method (FreeZAD) that applies pre-trained vision-language models to zero-shot temporal action detection via post-processing scores (LogOIC, frequency calibration) and optional TTA (PCS). All performance claims are evaluated against external prior methods on THUMOS14 and ActivityNet-1.3; no derivation chain reduces a claimed result to a fitted parameter or self-citation by construction. The central pipeline is a fixed, non-learned post-processor whose outputs are compared to independent baselines rather than being tautological with its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details on free parameters, axioms, or invented entities are provided in the abstract; assessment is limited to abstract content only.

pith-pipeline@v0.9.0 · 5756 in / 1005 out tokens · 24776 ms · 2026-05-23T04:48:11.452100+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
cs.CV 2026-05 unverdicted novelty 5.0

Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Diagnosing error in temporal action detectors

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV), pages 256–272, 2018. 8

work page 2018
[3]

Opental: Towards open set temporal action localization

Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2979–2989, 2022. 1

work page 2022
[4]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 6

work page 2015
[5]

Video mamba suite: State space model as a ver- satile alternative for video understanding

Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a ver- satile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024. 2

work page arXiv 2024
[6]

Cascade evidential learning for open-world weakly-supervised tem- poral action localization

Mengyuan Chen, Junyu Gao, and Changsheng Xu. Cascade evidential learning for open-world weakly-supervised tem- poral action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14741–14750, 2023. 1

work page 2023
[7]

An algorithm for the machine calculation of complex fourier series

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965. 5

work page 1965
[8]

Vqgan-clip: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Con- ference on Computer Vision, pages 88–105. Springer, 2022. 3

work page 2022
[9]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18135–18143, 2024. 3

work page 2024
[10]

in the wild

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding , 155:1– 23, 2017. 6

work page 2017
[11]

Test-time classifier adjustment module for model-agnostic domain generaliza- tion

Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generaliza- tion. Advances in Neural Information Processing Systems , 34:2427–2440, 2021. 3

work page 2021
[12]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

work page
[13]

Prompting visual-language models for efficient video understanding

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vi- sion, pages 105–124. Springer, 2022. 2, 6, 7

work page 2022
[14]

Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization

Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Qi Tian, and Yanfeng Wang. Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14751–14762, 2023. 3

work page 2023
[15]

Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression

Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong- Whan Lee. Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18837–18846, 2024. 2

work page 2024
[16]

Self-feedback detr for temporal action detection

Jihwan Kim, Miso Lee, and Jae-Pil Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10286–10296, 2023. 2

work page 2023
[17]

Detal: Open-vocabulary temporal action localization with decoupled networks

Zhiheng Li, Yujie Zhong, Ran Song, Tianjiao Li, Lin Ma, and Wei Zhang. Detal: Open-vocabulary temporal action localization with decoupled networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 3, 6, 7

work page 2024
[18]

A comprehensive sur- vey on test-time adaptation under distribution shifts

Jian Liang, Ran He, and Tieniu Tan. A comprehensive sur- vey on test-time adaptation under distribution shifts. Inter- national Journal of Computer Vision, pages 1–34, 2024. 2, 3

work page 2024
[19]

Test-time zero-shot temporal action localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720– 18729, 2024. 2, 3, 5, 6, 7

work page 2024
[20]

Single shot tempo- ral action detection

Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tempo- ral action detection. In Proceedings of the 25th ACM inter- national conference on Multimedia , pages 988–996, 2017. 2

work page 2017
[21]

Video test-time adaptation for action recognition

Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22952–22961, 2023. 3

work page 2023
[22]

End-to-end temporal action detection with 1b parameters across 1000 frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18591–18601, 2024. 2

work page 2024
[23]

Depth-aware test-time training for zero-shot video object segmentation

Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi- Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19218–19227, 2024. 3

work page 2024
[24]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013
[25]

Zero-shot temporal action detection via vision-language prompting

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xi- ang. Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision , pages 681–697. Springer, 2022. 2, 3, 6, 7

work page 2022
[26]

Clipping: Distilling clip-based models with a student base for video- language retrieval

Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Song- cen Xu, Peng Dai, Juwei Lu, and Youliang Yan. Clipping: Distilling clip-based models with a student base for video- language retrieval. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18983–18992, 2023. 3

work page 2023
[27]

Temporal context aggregation network for temporal action proposal refinement

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. 2

work page 2021
[28]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3

work page 2021
[29]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 3

work page 2022
[30]

Action sensitivity learning for temporal action localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, and Yi Yang. Action sensitivity learning for temporal action localization. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13457–13469,

work page
[31]

Tridet: Temporal action detection with relative boundary modeling

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023. 2

work page 2023
[32]

Autoloc: Weakly-supervised temporal action localization in untrimmed videos

Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision , pages 154– 171, 2018. 4

work page 2018
[33]

Re- laxed transformer decoders for direct action proposal gener- ation

Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. Re- laxed transformer decoders for direct action proposal gener- ation. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13526–13535, 2021. 2

work page 2021
[34]

Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields

Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3

work page 2022
[35]

Bilateral adaptation for human-object interac- tion detection with occlusion-robustness

Guangzhi Wang, Yangyang Guo, Ziwei Xu, and Mohan Kankanhalli. Bilateral adaptation for human-object interac- tion detection with occlusion-robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27970–27980, 2024. 3

work page 2024
[36]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

work page 2022
[37]

Image as a foreign language: Beit pretraining for vision and vision- language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 3

work page 2023
[38]

Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms

Yu Wang, Yadong Li, and Hongbin Wang. Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18878–18887, 2023. 1

work page 2023
[39]

Vita-clip: Video and text adaptive clip via multimodal prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 3

work page 2023
[40]

Learning to refactor action and co-occurrence fea- tures for temporal action localization

Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. Learning to refactor action and co-occurrence fea- tures for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2022. 2

work page 2022
[41]

Learning in the frequency domain

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1740–1749, 2020. 5

work page 2020
[42]

Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain

Shige Xu, Lei Zhang, Yin Tang, Chaolei Han, Hao Wu, and Aiguo Song. Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain. IEEE Transactions on Knowledge and Data Engi- neering, 35(12):12497–12512, 2023. 5

work page 2023
[43]

Basictad: an astounding rgb-only baseline for tem- poral action detection

Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. Basictad: an astounding rgb-only baseline for tem- poral action detection. Computer Vision and Image Under- standing, 232:103692, 2023. 1

work page 2023
[44]

Coca: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. 3, 6

work page 2022
[45]

Actionformer: Lo- calizing moments of actions with transformers

Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. In European Conference on Computer Vision , pages 492–510. Springer,

work page
[46]

Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation

Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7115–7123, 2024. 6

work page 2024
[47]

Do- mainadaptor: A novel approach to test-time adaptation

Jian Zhang, Lei Qi, Yinghuan Shi, and Yang Gao. Do- mainadaptor: A novel approach to test-time adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18971–18981, 2023. 3

work page 2023
[48]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page
[49]

Zstad: Zero-shot temporal activity detection

Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, and Alexander Hauptmann. Zstad: Zero-shot temporal activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2020. 2

work page 2020
[50]

Tn-zstad: Transfer- able network for zero-shot temporal activity detection

Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Zhi- hui Li, Lina Yao, and Alex Hauptmann. Tn-zstad: Transfer- able network for zero-shot temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3848–3861, 2022. 2

work page 2022
[51]

Video self- stitching graph network for temporal action localization

Chen Zhao, Ali K Thabet, and Bernard Ghanem. Video self- stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021. 2

work page 2021
[52]

Movement enhancement toward multi-scale video feature representa- tion for temporal action detection

Zixuan Zhao, Dongqi Wang, and Xu Zhao. Movement enhancement toward multi-scale video feature representa- tion for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13555–13564, 2023. 2

work page 2023
[53]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Com- puter Vision, pages 696–712. Springer, 2022. 3

work page 2022
[54]

Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition

Yuchen Zhou, Linkai Liu, and Chao Gou. Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28390–28400, 2024. 3

work page 2024
[55]

Dual detrs for multi-label temporal action de- tection

Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. Dual detrs for multi-label temporal action de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18559– 18569, 2024. 2

work page 2024

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page

[2] [2]

Diagnosing error in temporal action detectors

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV), pages 256–272, 2018. 8

work page 2018

[3] [3]

Opental: Towards open set temporal action localization

Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2979–2989, 2022. 1

work page 2022

[4] [4]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 6

work page 2015

[5] [5]

Video mamba suite: State space model as a ver- satile alternative for video understanding

Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a ver- satile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024. 2

work page arXiv 2024

[6] [6]

Cascade evidential learning for open-world weakly-supervised tem- poral action localization

Mengyuan Chen, Junyu Gao, and Changsheng Xu. Cascade evidential learning for open-world weakly-supervised tem- poral action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14741–14750, 2023. 1

work page 2023

[7] [7]

An algorithm for the machine calculation of complex fourier series

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965. 5

work page 1965

[8] [8]

Vqgan-clip: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Con- ference on Computer Vision, pages 88–105. Springer, 2022. 3

work page 2022

[9] [9]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18135–18143, 2024. 3

work page 2024

[10] [10]

in the wild

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding , 155:1– 23, 2017. 6

work page 2017

[11] [11]

Test-time classifier adjustment module for model-agnostic domain generaliza- tion

Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generaliza- tion. Advances in Neural Information Processing Systems , 34:2427–2440, 2021. 3

work page 2021

[12] [12]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

work page

[13] [13]

Prompting visual-language models for efficient video understanding

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vi- sion, pages 105–124. Springer, 2022. 2, 6, 7

work page 2022

[14] [14]

Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization

Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Qi Tian, and Yanfeng Wang. Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14751–14762, 2023. 3

work page 2023

[15] [15]

Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression

Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong- Whan Lee. Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18837–18846, 2024. 2

work page 2024

[16] [16]

Self-feedback detr for temporal action detection

Jihwan Kim, Miso Lee, and Jae-Pil Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10286–10296, 2023. 2

work page 2023

[17] [17]

Detal: Open-vocabulary temporal action localization with decoupled networks

Zhiheng Li, Yujie Zhong, Ran Song, Tianjiao Li, Lin Ma, and Wei Zhang. Detal: Open-vocabulary temporal action localization with decoupled networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 3, 6, 7

work page 2024

[18] [18]

A comprehensive sur- vey on test-time adaptation under distribution shifts

Jian Liang, Ran He, and Tieniu Tan. A comprehensive sur- vey on test-time adaptation under distribution shifts. Inter- national Journal of Computer Vision, pages 1–34, 2024. 2, 3

work page 2024

[19] [19]

Test-time zero-shot temporal action localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720– 18729, 2024. 2, 3, 5, 6, 7

work page 2024

[20] [20]

Single shot tempo- ral action detection

Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tempo- ral action detection. In Proceedings of the 25th ACM inter- national conference on Multimedia , pages 988–996, 2017. 2

work page 2017

[21] [21]

Video test-time adaptation for action recognition

Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22952–22961, 2023. 3

work page 2023

[22] [22]

End-to-end temporal action detection with 1b parameters across 1000 frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18591–18601, 2024. 2

work page 2024

[23] [23]

Depth-aware test-time training for zero-shot video object segmentation

Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi- Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19218–19227, 2024. 3

work page 2024

[24] [24]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013

[25] [25]

Zero-shot temporal action detection via vision-language prompting

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xi- ang. Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision , pages 681–697. Springer, 2022. 2, 3, 6, 7

work page 2022

[26] [26]

Clipping: Distilling clip-based models with a student base for video- language retrieval

Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Song- cen Xu, Peng Dai, Juwei Lu, and Youliang Yan. Clipping: Distilling clip-based models with a student base for video- language retrieval. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18983–18992, 2023. 3

work page 2023

[27] [27]

Temporal context aggregation network for temporal action proposal refinement

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. 2

work page 2021

[28] [28]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3

work page 2021

[29] [29]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 3

work page 2022

[30] [30]

Action sensitivity learning for temporal action localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, and Yi Yang. Action sensitivity learning for temporal action localization. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13457–13469,

work page

[31] [31]

Tridet: Temporal action detection with relative boundary modeling

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023. 2

work page 2023

[32] [32]

Autoloc: Weakly-supervised temporal action localization in untrimmed videos

Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision , pages 154– 171, 2018. 4

work page 2018

[33] [33]

Re- laxed transformer decoders for direct action proposal gener- ation

Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. Re- laxed transformer decoders for direct action proposal gener- ation. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13526–13535, 2021. 2

work page 2021

[34] [34]

Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields

Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3

work page 2022

[35] [35]

Bilateral adaptation for human-object interac- tion detection with occlusion-robustness

Guangzhi Wang, Yangyang Guo, Ziwei Xu, and Mohan Kankanhalli. Bilateral adaptation for human-object interac- tion detection with occlusion-robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27970–27980, 2024. 3

work page 2024

[36] [36]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

work page 2022

[37] [37]

Image as a foreign language: Beit pretraining for vision and vision- language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 3

work page 2023

[38] [38]

Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms

Yu Wang, Yadong Li, and Hongbin Wang. Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18878–18887, 2023. 1

work page 2023

[39] [39]

Vita-clip: Video and text adaptive clip via multimodal prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 3

work page 2023

[40] [40]

Learning to refactor action and co-occurrence fea- tures for temporal action localization

Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. Learning to refactor action and co-occurrence fea- tures for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2022. 2

work page 2022

[41] [41]

Learning in the frequency domain

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1740–1749, 2020. 5

work page 2020

[42] [42]

Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain

Shige Xu, Lei Zhang, Yin Tang, Chaolei Han, Hao Wu, and Aiguo Song. Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain. IEEE Transactions on Knowledge and Data Engi- neering, 35(12):12497–12512, 2023. 5

work page 2023

[43] [43]

Basictad: an astounding rgb-only baseline for tem- poral action detection

Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. Basictad: an astounding rgb-only baseline for tem- poral action detection. Computer Vision and Image Under- standing, 232:103692, 2023. 1

work page 2023

[44] [44]

Coca: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. 3, 6

work page 2022

[45] [45]

Actionformer: Lo- calizing moments of actions with transformers

Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. In European Conference on Computer Vision , pages 492–510. Springer,

work page

[46] [46]

Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation

Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7115–7123, 2024. 6

work page 2024

[47] [47]

Do- mainadaptor: A novel approach to test-time adaptation

Jian Zhang, Lei Qi, Yinghuan Shi, and Yang Gao. Do- mainadaptor: A novel approach to test-time adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18971–18981, 2023. 3

work page 2023

[48] [48]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page

[49] [49]

Zstad: Zero-shot temporal activity detection

Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, and Alexander Hauptmann. Zstad: Zero-shot temporal activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2020. 2

work page 2020

[50] [50]

Tn-zstad: Transfer- able network for zero-shot temporal activity detection

Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Zhi- hui Li, Lina Yao, and Alex Hauptmann. Tn-zstad: Transfer- able network for zero-shot temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3848–3861, 2022. 2

work page 2022

[51] [51]

Video self- stitching graph network for temporal action localization

Chen Zhao, Ali K Thabet, and Bernard Ghanem. Video self- stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021. 2

work page 2021

[52] [52]

Movement enhancement toward multi-scale video feature representa- tion for temporal action detection

Zixuan Zhao, Dongqi Wang, and Xu Zhao. Movement enhancement toward multi-scale video feature representa- tion for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13555–13564, 2023. 2

work page 2023

[53] [53]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Com- puter Vision, pages 696–712. Springer, 2022. 3

work page 2022

[54] [54]

Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition

Yuchen Zhou, Linkai Liu, and Chao Gou. Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28390–28400, 2024. 3

work page 2024

[55] [55]

Dual detrs for multi-label temporal action de- tection

Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. Dual detrs for multi-label temporal action de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18559– 18569, 2024. 2

work page 2024