Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models
Pith reviewed 2026-05-23 04:48 UTC · model grok-4.3
The pith
Pre-trained vision-language models can directly detect unseen actions in untrimmed videos without any training or fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing vision-language models can be leveraged directly for zero-shot temporal action detection without additional fine-tuning by using a logarithmic decay weighted outer-inner contrastive score to evaluate action proposals and frequency-based actionness calibration, leading to better performance than unsupervised methods at a fraction of the runtime.
What carries the argument
The Logarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration that enable direct use of ViL models for classifying and localizing unseen actions.
If this is right
- The method requires no training, reducing runtime to 1/13 of previous unsupervised approaches.
- It outperforms state-of-the-art unsupervised zero-shot temporal action detection methods on standard datasets.
- Equipping it with test-time adaptation using prototype-centric sampling narrows the performance gap with fully supervised methods.
- The approach mitigates issues from domain shifts and dependence on pseudo-label quality.
Where Pith is reading between the lines
- Similar calibration techniques could be tested on other video understanding tasks that use pre-trained models.
- This suggests that vision-language models encode sufficient temporal structure for action localization even without explicit training on video data.
Load-bearing premise
Pre-trained vision-language models already contain enough knowledge to classify and localize unseen actions when the right scoring and calibration functions are applied to video proposals.
What would settle it
Running the method on THUMOS14 and finding that its mean average precision is lower than current unsupervised methods would falsify the performance claim.
Figures
read the original abstract
Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FreeZAD, a training-free zero-shot temporal action detection method that leverages pre-trained vision-language models to classify and localize unseen actions in untrimmed videos. It introduces the Logarithmic decay weighted Outer-Inner-Contrastive (LogOIC) score and frequency-based Actionness Calibration to avoid explicit temporal modeling and pseudo-labels, plus a test-time adaptation (TTA) strategy with Prototype-Centric Sampling (PCS). Experiments on THUMOS14 and ActivityNet-1.3 claim outperformance over state-of-the-art unsupervised ZSTAD methods at 1/13 the runtime, with TTA further narrowing the gap to fully supervised approaches.
Significance. If the empirical claims hold under full verification, the work would be significant for demonstrating that fixed pre-trained ViL models plus lightweight post-processing can deliver competitive ZSTAD performance without any training or adaptation, substantially lowering computational barriers compared to prior unsupervised methods.
major comments (3)
- [Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.
- [§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.
- [TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.
minor comments (1)
- [§3.3] Notation for Actionness Calibration is introduced without a clear link to the frequency-based formula; a short equation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that enhance verifiability and clarity without altering the core claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: performance claims of outperformance on THUMOS14 and ActivityNet-1.3 together with the 1/13 runtime reduction are stated without accompanying tables of per-class mAP, exact baseline implementations, metric definitions (e.g., IoU thresholds), or measured wall-clock times, preventing independent verification of the central empirical result.
Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript we will add tables reporting per-class mAP, explicitly list the IoU thresholds used (standard 0.1–0.5 for THUMOS14 and 0.5–0.95 for ActivityNet), document the precise baseline implementations and code references, and include measured wall-clock times on the same hardware used for all methods. These additions will directly support the reported outperformance and runtime claims. revision: yes
-
Referee: [§3] §3 (Method), LogOIC definition: the claim that the logarithmic decay weighting mitigates domain shift and reliance on pseudo-label quality is not supported by an explicit equation or derivation showing how the outer-inner contrastive term is computed from ViL logits; without this, it is impossible to assess whether the score is parameter-free or reproducible.
Authors: Section 3 already presents LogOIC as a parameter-free score computed from ViL logits, but we acknowledge the derivation could be more explicit. We will insert a dedicated equation block and step-by-step derivation in the revised §3 that shows how the outer-inner contrastive term is obtained from the logits and how the logarithmic decay is applied, thereby confirming reproducibility and the mechanism for reducing domain-shift sensitivity. revision: yes
-
Referee: [TTA subsection] TTA subsection: the Prototype-Centric Sampling strategy is presented as reliably improving performance, yet no ablation isolating PCS from the base FreeZAD pipeline or analysis of sampling bias on unseen classes is provided, leaving the weakest assumption (direct ViL applicability without fine-tuning) untested.
Authors: We will add a new ablation subsection that isolates the effect of Prototype-Centric Sampling (PCS) on the base FreeZAD pipeline and includes quantitative analysis of sampling bias across unseen classes. This will provide direct evidence for the contribution of the TTA component. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical method (FreeZAD) that applies pre-trained vision-language models to zero-shot temporal action detection via post-processing scores (LogOIC, frequency calibration) and optional TTA (PCS). All performance claims are evaluated against external prior methods on THUMOS14 and ActivityNet-1.3; no derivation chain reduces a claimed result to a fitted parameter or self-citation by construction. The central pipeline is a fixed, non-learned post-processor whose outputs are compared to independent baselines rather than being tautological with its inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,
-
[2]
Diagnosing error in temporal action detectors
Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV), pages 256–272, 2018. 8
work page 2018
-
[3]
Opental: Towards open set temporal action localization
Wentao Bao, Qi Yu, and Yu Kong. Opental: Towards open set temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2979–2989, 2022. 1
work page 2022
-
[4]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 6
work page 2015
-
[5]
Video mamba suite: State space model as a ver- satile alternative for video understanding
Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, and Limin Wang. Video mamba suite: State space model as a ver- satile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024. 2
-
[6]
Cascade evidential learning for open-world weakly-supervised tem- poral action localization
Mengyuan Chen, Junyu Gao, and Changsheng Xu. Cascade evidential learning for open-world weakly-supervised tem- poral action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14741–14750, 2023. 1
work page 2023
-
[7]
An algorithm for the machine calculation of complex fourier series
James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965. 5
work page 1965
-
[8]
Vqgan-clip: Open domain image generation and editing with natural language guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Con- ference on Computer Vision, pages 88–105. Springer, 2022. 3
work page 2022
-
[9]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18135–18143, 2024. 3
work page 2024
-
[10]
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding , 155:1– 23, 2017. 6
work page 2017
-
[11]
Test-time classifier adjustment module for model-agnostic domain generaliza- tion
Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generaliza- tion. Advances in Neural Information Processing Systems , 34:2427–2440, 2021. 3
work page 2021
-
[12]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR,
-
[13]
Prompting visual-language models for efficient video understanding
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vi- sion, pages 105–124. Springer, 2022. 2, 6, 7
work page 2022
-
[14]
Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Qi Tian, and Yanfeng Wang. Distill- ing vision-language pre-training to collaborate with weakly- supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14751–14762, 2023. 3
work page 2023
-
[15]
Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression
Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong- Whan Lee. Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18837–18846, 2024. 2
work page 2024
-
[16]
Self-feedback detr for temporal action detection
Jihwan Kim, Miso Lee, and Jae-Pil Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10286–10296, 2023. 2
work page 2023
-
[17]
Detal: Open-vocabulary temporal action localization with decoupled networks
Zhiheng Li, Yujie Zhong, Ran Song, Tianjiao Li, Lin Ma, and Wei Zhang. Detal: Open-vocabulary temporal action localization with decoupled networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 3, 6, 7
work page 2024
-
[18]
A comprehensive sur- vey on test-time adaptation under distribution shifts
Jian Liang, Ran He, and Tieniu Tan. A comprehensive sur- vey on test-time adaptation under distribution shifts. Inter- national Journal of Computer Vision, pages 1–34, 2024. 2, 3
work page 2024
-
[19]
Test-time zero-shot temporal action localization
Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720– 18729, 2024. 2, 3, 5, 6, 7
work page 2024
-
[20]
Single shot tempo- ral action detection
Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tempo- ral action detection. In Proceedings of the 25th ACM inter- national conference on Multimedia , pages 988–996, 2017. 2
work page 2017
-
[21]
Video test-time adaptation for action recognition
Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22952–22961, 2023. 3
work page 2023
-
[22]
End-to-end temporal action detection with 1b parameters across 1000 frames
Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18591–18601, 2024. 2
work page 2024
-
[23]
Depth-aware test-time training for zero-shot video object segmentation
Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi- Man Pun, and Xiaodong Cun. Depth-aware test-time training for zero-shot video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19218–19227, 2024. 3
work page 2024
-
[24]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[25]
Zero-shot temporal action detection via vision-language prompting
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xi- ang. Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision , pages 681–697. Springer, 2022. 2, 3, 6, 7
work page 2022
-
[26]
Clipping: Distilling clip-based models with a student base for video- language retrieval
Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Song- cen Xu, Peng Dai, Juwei Lu, and Youliang Yan. Clipping: Distilling clip-based models with a student base for video- language retrieval. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18983–18992, 2023. 3
work page 2023
-
[27]
Temporal context aggregation network for temporal action proposal refinement
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. 2
work page 2021
-
[28]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2, 3
work page 2021
-
[29]
Denseclip: Language-guided dense prediction with context- aware prompting
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 3
work page 2022
-
[30]
Action sensitivity learning for temporal action localization
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, and Yi Yang. Action sensitivity learning for temporal action localization. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13457–13469,
-
[31]
Tridet: Temporal action detection with relative boundary modeling
Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023. 2
work page 2023
-
[32]
Autoloc: Weakly-supervised temporal action localization in untrimmed videos
Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision , pages 154– 171, 2018. 4
work page 2018
-
[33]
Re- laxed transformer decoders for direct action proposal gener- ation
Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. Re- laxed transformer decoders for direct action proposal gener- ation. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13526–13535, 2021. 2
work page 2021
-
[34]
Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3
work page 2022
-
[35]
Bilateral adaptation for human-object interac- tion detection with occlusion-robustness
Guangzhi Wang, Yangyang Guo, Ziwei Xu, and Mohan Kankanhalli. Bilateral adaptation for human-object interac- tion detection with occlusion-robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27970–27980, 2024. 3
work page 2024
-
[36]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3
work page 2022
-
[37]
Image as a foreign language: Beit pretraining for vision and vision- language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 3
work page 2023
-
[38]
Yu Wang, Yadong Li, and Hongbin Wang. Two-stream networks for weakly-supervised temporal action localiza- tion with semantic-aware mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18878–18887, 2023. 1
work page 2023
-
[39]
Vita-clip: Video and text adaptive clip via multimodal prompting
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 3
work page 2023
-
[40]
Learning to refactor action and co-occurrence fea- tures for temporal action localization
Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. Learning to refactor action and co-occurrence fea- tures for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2022. 2
work page 2022
-
[41]
Learning in the frequency domain
Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1740–1749, 2020. 5
work page 2020
-
[42]
Shige Xu, Lei Zhang, Yin Tang, Chaolei Han, Hao Wu, and Aiguo Song. Channel attention for sensor-based activity recognition: embedding features into all frequencies in dct domain. IEEE Transactions on Knowledge and Data Engi- neering, 35(12):12497–12512, 2023. 5
work page 2023
-
[43]
Basictad: an astounding rgb-only baseline for tem- poral action detection
Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. Basictad: an astounding rgb-only baseline for tem- poral action detection. Computer Vision and Image Under- standing, 232:103692, 2023. 1
work page 2023
-
[44]
Coca: Contrastive captioners are image-text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. 3, 6
work page 2022
-
[45]
Actionformer: Lo- calizing moments of actions with transformers
Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. In European Conference on Computer Vision , pages 492–510. Springer,
-
[46]
Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation
Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, and Nong Sang. Hr-pro: Point-supervised temporal action localization via hierarchical reliability prop- agation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7115–7123, 2024. 6
work page 2024
-
[47]
Do- mainadaptor: A novel approach to test-time adaptation
Jian Zhang, Lei Qi, Yinghuan Shi, and Yang Gao. Do- mainadaptor: A novel approach to test-time adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18971–18981, 2023. 3
work page 2023
-
[48]
Vision-language models for vision tasks: A survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,
-
[49]
Zstad: Zero-shot temporal activity detection
Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, and Alexander Hauptmann. Zstad: Zero-shot temporal activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2020. 2
work page 2020
-
[50]
Tn-zstad: Transfer- able network for zero-shot temporal activity detection
Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Zhi- hui Li, Lina Yao, and Alex Hauptmann. Tn-zstad: Transfer- able network for zero-shot temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3848–3861, 2022. 2
work page 2022
-
[51]
Video self- stitching graph network for temporal action localization
Chen Zhao, Ali K Thabet, and Bernard Ghanem. Video self- stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021. 2
work page 2021
-
[52]
Movement enhancement toward multi-scale video feature representa- tion for temporal action detection
Zixuan Zhao, Dongqi Wang, and Xu Zhao. Movement enhancement toward multi-scale video feature representa- tion for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13555–13564, 2023. 2
work page 2023
-
[53]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Com- puter Vision, pages 696–712. Springer, 2022. 3
work page 2022
-
[54]
Yuchen Zhou, Linkai Liu, and Chao Gou. Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28390–28400, 2024. 3
work page 2024
-
[55]
Dual detrs for multi-label temporal action de- tection
Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. Dual detrs for multi-label temporal action de- tection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18559– 18569, 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.