pith. sign in

arxiv: 2501.13795 · v1 · pith:OOUWVM2Cnew · submitted 2025-01-23 · 💻 cs.CV

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Pith reviewed 2026-05-23 04:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot temporal action detectionvision-language modelstraining-free methodLogOIC scoretest-time adaptationtemporal action localization
0
0 comments X

The pith

Pre-trained vision-language models can directly detect unseen actions in untrimmed videos without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training-free zero-shot temporal action detection method called FreeZAD that uses existing vision-language models to classify and localize unseen activities in videos. It introduces the LogOIC score and frequency-based calibration to handle proposals without needing explicit temporal modeling or high-quality pseudo-labels. This approach avoids the domain shifts and high computational costs of training-based methods. Experiments show it outperforms state-of-the-art unsupervised methods on THUMOS14 and ActivityNet-1.3 while using only 1/13 of the runtime. Adding test-time adaptation with prototype-centric sampling further closes the gap to fully supervised methods.

Core claim

Existing vision-language models can be leveraged directly for zero-shot temporal action detection without additional fine-tuning by using a logarithmic decay weighted outer-inner contrastive score to evaluate action proposals and frequency-based actionness calibration, leading to better performance than unsupervised methods at a fraction of the runtime.

What carries the argument

The Logarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration that enable direct use of ViL models for classifying and localizing unseen actions.

If this is right

  • The method requires no training, reducing runtime to 1/13 of previous unsupervised approaches.
  • It outperforms state-of-the-art unsupervised zero-shot temporal action detection methods on standard datasets.
  • Equipping it with test-time adaptation using prototype-centric sampling narrows the performance gap with fully supervised methods.
  • The approach mitigates issues from domain shifts and dependence on pseudo-label quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar calibration techniques could be tested on other video understanding tasks that use pre-trained models.
  • This suggests that vision-language models encode sufficient temporal structure for action localization even without explicit training on video data.

Load-bearing premise

Pre-trained vision-language models already contain enough knowledge to classify and localize unseen actions when the right scoring and calibration functions are applied to video proposals.

What would settle it

Running the method on THUMOS14 and finding that its mean average precision is lower than current unsupervised methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2501.13795 by Chaolei Han, Hongsong Wang, Jidong Kuang, Jie Gui, Lei Zhang.

Figure 1
Figure 1. Figure 1: We propose the first training-free method for ZS￾TAD, distinguishing it from all existing state-of-the-art methods, which are training-based with varying degrees of supervision. 1. Introduction With the development of social media and surveillance sys￾tems, video understanding has become more important. As a fundamental task in video understanding, temporal action detection (TAD) [38, 43] aims to recognize… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of our proposed training-free zero-shot temporal action detection network. The model recognizes and localizes unseen activities within untrimmed videos with only a single forward pass. Specifically, video-level label is first generated through visual embedding and textual encoding derived from the visual and textual backbones of ViL models. Then, segments are selected by calculating th… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of Logarithmic Decay Weighted Outer-Inner-Contrastive Score (LogOIC). The green mask and the red mask respectively cover the inner and outer activations within a segment. The adjusted outer activation, shown in orange, are obtained by reweighting the blue activation with a logarithmic decay weight. with its textual feature. Building on this consideration, the cosine similarity is directly c… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of zero-shot temporal action detection with test-time adaptation (AdaZAD). This paradigm seeks to adapt the pre-trained ViL models for TAD without annotations, and the detected results are obtained after completing all adapta￾tion steps. Prototype-Centric Sampling (PCS) focuses on selecting positive samples surrounding the highest similarity values to the textual features. segment-level calibr… view at source ↗
Figure 5
Figure 5. Figure 5: The left plot shows the average mAP across various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error analysis of our FreeZAD. The left is false pos￾itive profile and the right is removing error impact, where G de￾notes the number of ground truths. Only the Top-1G to Top-3G predictions are provided due to the sparsity of the predictions. The enhanced model with a designed test-time adaptation strategy further outperforms the performance and nar￾rows the gap with fully supervised methods. Extensive ab… view at source ↗
read the original abstract

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes FreeZAD, a training-free zero-shot temporal action detection method that leverages pre-trained vision-language models to classify and localize unseen actions in untrimmed videos. It introduces the Logarithmic decay weighted Outer-Inner-Contrastive (LogOIC) score and frequency-based Actionness Calibration to avoid explicit temporal modeling and pseudo-labels, plus a test-time adaptation (TTA) strategy with Prototype-Centric Sampling (PCS). Experiments on THUMOS14 and ActivityNet-1.3 claim outperformance over state-of-the-art unsupervised ZSTAD methods at 1/13 the runtime, with TTA further narrowing the gap to fully supervised approaches.

Significance. If the empirical claims hold under full verification, the work would be significant for demonstrating that fixed pre-trained ViL models plus lightweight post-processing can deliver competitive ZSTAD performance without any training or adaptation, substantially lowering computational barriers compared to prior unsupervised methods.

major comments (3)
  1. [Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.
  2. [§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.
  3. [TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.
minor comments (1)
  1. [§3.3] Notation for Actionness Calibration is introduced without a clear link to the frequency-based formula; a short equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that enhance verifiability and clarity without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.

    Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript we will add tables reporting per-class mAP, explicitly list the IoU thresholds used (standard 0.1–0.5 for THUMOS14 and 0.5–0.95 for ActivityNet), document the precise baseline implementations and code references, and include measured wall-clock times on the same hardware used for all methods. These additions will directly support the reported outperformance and runtime claims. revision: yes

  2. Referee: [§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.

    Authors: Section 3 already presents LogOIC as a parameter-free score computed from ViL logits, but we acknowledge the derivation could be more explicit. We will insert a dedicated equation block and step-by-step derivation in the revised §3 that shows how the outer-inner contrastive term is obtained from the logits and how the logarithmic decay is applied, thereby confirming reproducibility and the mechanism for reducing domain-shift sensitivity. revision: yes

  3. Referee: [TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.

    Authors: We will add a new ablation subsection that isolates the effect of Prototype-Centric Sampling (PCS) on the base FreeZAD pipeline and includes quantitative analysis of sampling bias across unseen classes. This will provide direct evidence for the contribution of the TTA component. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method (FreeZAD) that applies pre-trained vision-language models to zero-shot temporal action detection via post-processing scores (LogOIC, frequency calibration) and optional TTA (PCS). All performance claims are evaluated against external prior methods on THUMOS14 and ActivityNet-1.3; no derivation chain reduces a claimed result to a fitted parameter or self-citation by construction. The central pipeline is a fixed, non-learned post-processor whose outputs are compared to independent baselines rather than being tautological with its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details on free parameters, axioms, or invented entities are provided in the abstract; assessment is limited to abstract content only.

pith-pipeline@v0.9.0 · 5756 in / 1005 out tokens · 24776 ms · 2026-05-23T04:48:11.452100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions

    cs.CV 2026-05 unverdicted novelty 5.0

    Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    Diagnosing error in temporal action detectors

    Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV), pages 256–272, 2018. 8

  3. [3]

    Opental: Towards open set temporal action localization

    Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2979–2989, 2022. 1

  4. [4]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 6

  5. [5]

    Video mamba suite: State space model as a ver- satile alternative for video understanding

    Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a ver- satile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024. 2

  6. [6]

    Cascade evidential learning for open-world weakly-supervised tem- poral action localization

    Mengyuan Chen, Junyu Gao, and Changsheng Xu. Cascade evidential learning for open-world weakly-supervised tem- poral action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14741–14750, 2023. 1

  7. [7]

    An algorithm for the machine calculation of complex fourier series

    James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965. 5

  8. [8]

    Vqgan-clip: Open domain image generation and editing with natural language guidance

    Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Con- ference on Computer Vision, pages 88–105. Springer, 2022. 3

  9. [9]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18135–18143, 2024. 3

  10. [10]

    in the wild

    Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding , 155:1– 23, 2017. 6

  11. [11]

    Test-time classifier adjustment module for model-agnostic domain generaliza- tion

    Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generaliza- tion. Advances in Neural Information Processing Systems , 34:2427–2440, 2021. 3

  12. [12]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,

  13. [13]

    Prompting visual-language models for efficient video understanding

    Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vi- sion, pages 105–124. Springer, 2022. 2, 6, 7

  14. [14]

    Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization

    Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Qi Tian, and Yanfeng Wang. Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14751–14762, 2023. 3

  15. [15]

    Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression

    Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong- Whan Lee. Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18837–18846, 2024. 2

  16. [16]

    Self-feedback detr for temporal action detection

    Jihwan Kim, Miso Lee, and Jae-Pil Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10286–10296, 2023. 2

  17. [17]

    Detal: Open-vocabulary temporal action localization with decoupled networks

    Zhiheng Li, Yujie Zhong, Ran Song, Tianjiao Li, Lin Ma, and Wei Zhang. Detal: Open-vocabulary temporal action localization with decoupled networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 3, 6, 7

  18. [18]

    A comprehensive sur- vey on test-time adaptation under distribution shifts

    Jian Liang, Ran He, and Tieniu Tan. A comprehensive sur- vey on test-time adaptation under distribution shifts. Inter- national Journal of Computer Vision, pages 1–34, 2024. 2, 3

  19. [19]

    Test-time zero-shot temporal action localization

    Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720– 18729, 2024. 2, 3, 5, 6, 7

  20. [20]

    Single shot tempo- ral action detection

    Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tempo- ral action detection. In Proceedings of the 25th ACM inter- national conference on Multimedia , pages 988–996, 2017. 2

  21. [21]

    Video test-time adaptation for action recognition

    Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22952–22961, 2023. 3

  22. [22]

    End-to-end temporal action detection with 1b parameters across 1000 frames

    Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18591–18601, 2024. 2

  23. [23]

    Depth-aware test-time training for zero-shot video object segmentation

    Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi- Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19218–19227, 2024. 3

  24. [24]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2

  25. [25]

    Zero-shot temporal action detection via vision-language prompting

    Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xi- ang. Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision , pages 681–697. Springer, 2022. 2, 3, 6, 7

  26. [26]

    Clipping: Distilling clip-based models with a student base for video- language retrieval

    Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Song- cen Xu, Peng Dai, Juwei Lu, and Youliang Yan. Clipping: Distilling clip-based models with a student base for video- language retrieval. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18983–18992, 2023. 3

  27. [27]

    Temporal context aggregation network for temporal action proposal refinement

    Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. 2

  28. [28]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3

  29. [29]

    Denseclip: Language-guided dense prediction with context- aware prompting

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 3

  30. [30]

    Action sensitivity learning for temporal action localization

    Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, and Yi Yang. Action sensitivity learning for temporal action localization. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13457–13469,

  31. [31]

    Tridet: Temporal action detection with relative boundary modeling

    Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023. 2

  32. [32]

    Autoloc: Weakly-supervised temporal action localization in untrimmed videos

    Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision , pages 154– 171, 2018. 4

  33. [33]

    Re- laxed transformer decoders for direct action proposal gener- ation

    Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. Re- laxed transformer decoders for direct action proposal gener- ation. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13526–13535, 2021. 2

  34. [34]

    Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields

    Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3

  35. [35]

    Bilateral adaptation for human-object interac- tion detection with occlusion-robustness

    Guangzhi Wang, Yangyang Guo, Ziwei Xu, and Mohan Kankanhalli. Bilateral adaptation for human-object interac- tion detection with occlusion-robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27970–27980, 2024. 3

  36. [36]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

  37. [37]

    Image as a foreign language: Beit pretraining for vision and vision- language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 3

  38. [38]

    Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms

    Yu Wang, Yadong Li, and Hongbin Wang. Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18878–18887, 2023. 1

  39. [39]

    Vita-clip: Video and text adaptive clip via multimodal prompting

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 3

  40. [40]

    Learning to refactor action and co-occurrence fea- tures for temporal action localization

    Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. Learning to refactor action and co-occurrence fea- tures for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2022. 2

  41. [41]

    Learning in the frequency domain

    Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1740–1749, 2020. 5

  42. [42]

    Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain

    Shige Xu, Lei Zhang, Yin Tang, Chaolei Han, Hao Wu, and Aiguo Song. Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain. IEEE Transactions on Knowledge and Data Engi- neering, 35(12):12497–12512, 2023. 5

  43. [43]

    Basictad: an astounding rgb-only baseline for tem- poral action detection

    Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. Basictad: an astounding rgb-only baseline for tem- poral action detection. Computer Vision and Image Under- standing, 232:103692, 2023. 1

  44. [44]

    Coca: Contrastive captioners are image-text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. 3, 6

  45. [45]

    Actionformer: Lo- calizing moments of actions with transformers

    Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. In European Conference on Computer Vision , pages 492–510. Springer,

  46. [46]

    Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation

    Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7115–7123, 2024. 6

  47. [47]

    Do- mainadaptor: A novel approach to test-time adaptation

    Jian Zhang, Lei Qi, Yinghuan Shi, and Yang Gao. Do- mainadaptor: A novel approach to test-time adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18971–18981, 2023. 3

  48. [48]

    Vision-language models for vision tasks: A survey

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,

  49. [49]

    Zstad: Zero-shot temporal activity detection

    Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, and Alexander Hauptmann. Zstad: Zero-shot temporal activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2020. 2

  50. [50]

    Tn-zstad: Transfer- able network for zero-shot temporal activity detection

    Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Zhi- hui Li, Lina Yao, and Alex Hauptmann. Tn-zstad: Transfer- able network for zero-shot temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3848–3861, 2022. 2

  51. [51]

    Video self- stitching graph network for temporal action localization

    Chen Zhao, Ali K Thabet, and Bernard Ghanem. Video self- stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021. 2

  52. [52]

    Movement enhancement toward multi-scale video feature representa- tion for temporal action detection

    Zixuan Zhao, Dongqi Wang, and Xu Zhao. Movement enhancement toward multi-scale video feature representa- tion for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13555–13564, 2023. 2

  53. [53]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Com- puter Vision, pages 696–712. Springer, 2022. 3

  54. [54]

    Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition

    Yuchen Zhou, Linkai Liu, and Chao Gou. Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28390–28400, 2024. 3

  55. [55]

    Dual detrs for multi-label temporal action de- tection

    Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. Dual detrs for multi-label temporal action de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18559– 18569, 2024. 2