Zero-Shot Temporal Action Localization Through Textual Guidance
Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3
The pith
Rich textual information from language models enables training-free localization of unseen actions in videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose TEGU, a novel approach for zero-shot temporal action localization that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context improves fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets, where it outperforms state-of-the-art ZS-TAL approaches that do not involve training.
What carries the argument
Textual guidance that combines large language model outputs with structured text from captions to supply richer cues for distinguishing actions.
If this is right
- Actions never encountered in training can still be localized in untrimmed videos.
- Large annotated video datasets are no longer required for competitive zero-shot performance.
- Higher accuracy is achieved than other training-free ZS-TAL methods on THUMOS14 and ActivityNet-v1.3.
- Linguistic context helps separate fine-grained actions that vision models alone struggle to tell apart.
Where Pith is reading between the lines
- The same textual enrichment could be tested on zero-shot video retrieval or segmentation tasks.
- Systems for domains with changing action sets, such as monitoring or sports, could update without new video labels.
- Pairing the guidance with light visual adaptation on a few examples might further raise accuracy in practice.
Load-bearing premise
Rich textual information from large language models and captions can provide sufficient cues to discriminate fine-grained actions without any training on video examples.
What would settle it
A direct comparison on videos of closely related actions where removing the textual components leaves performance unchanged or improved would show the added text does not deliver the claimed discrimination benefit.
Figures
read the original abstract
Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TEGU, a training-free zero-shot temporal action localization (ZS-TAL) method that compensates for absent labeled video data by leveraging rich textual information from large language models and structured text extracted from video captions. This linguistic context is intended to supply finer cues for distinguishing unseen actions and improving both classification and temporal boundary localization. Experiments on THUMOS14 and ActivityNet-v1.3 are reported to show outperformance over prior training-free ZS-TAL baselines that rely on VLMs.
Significance. If the central claim holds, the work would be significant for demonstrating that pre-trained LLMs and caption-derived text can substitute for supervised training in ZS-TAL, offering a path toward more generalizable, annotation-free video localization that builds directly on frozen VLMs without task-specific fine-tuning.
major comments (2)
- [Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.
- [Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.
minor comments (2)
- [Abstract] Abstract: the phrase 'structured text extracted from captions' is used without defining the extraction process or the structure imposed; a brief clarification would aid readability.
- [Method] Notation: the integration of textual embeddings with VLM visual features could be denoted more explicitly (e.g., via an equation showing the fusion step) to avoid ambiguity for readers.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our paper. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.
Authors: We appreciate this observation. Our approach leverages LLM-generated descriptions to create more discriminative class representations and uses structured text from video captions to provide contextual cues that aid in distinguishing action boundaries. The integration occurs through enhanced prompting of the VLM, where the textual information is concatenated with segment features to improve zero-shot classification per potential action segment. This indirectly refines localization by better identifying action presence, which is then used for boundary detection via post-processing. We acknowledge that the current description in the method section could be more explicit about this process. In the revised manuscript, we will add a dedicated subsection detailing the prompt construction and how it influences the VLM's temporal decisions without requiring additional training or modeling. revision: yes
-
Referee: [Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.
Authors: We agree that such ablations would strengthen the claims. Our current results demonstrate improvements in standard ZS-TAL metrics like mAP at various IoU thresholds on THUMOS14 and ActivityNet-v1.3, which inherently require accurate localization. To isolate the effects, we will include new ablation experiments in the revised version that compare performance using standard class names versus the rich textual guidance, reporting both overall mAP and specifically at high IoU (e.g., 0.5 and 0.7) to highlight localization improvements. This will help demonstrate that the textual information contributes to better boundary refinement beyond just classification. revision: yes
Circularity Check
No circularity: method relies on external pre-trained VLMs/LLMs and caption text rather than self-referential fits or derivations
full rationale
The paper introduces TEGU as a training-free ZS-TAL approach that augments frozen VLMs with rich textual cues from LLMs and structured captions to improve fine-grained discrimination and localization. No equations, parameters, or predictions are shown to reduce by construction to the paper's own inputs or prior self-citations. All core components (VLM backbone, LLM-generated descriptions, caption extraction) are external and independently pre-trained, with performance evaluated on standard benchmarks (THUMOS14, ActivityNet-v1.3) against other non-training baselines. This satisfies the self-contained criterion with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose a novel approach, “Textual Guidance for finer localization of actions in videos” (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions... scene triplets... affine triplets and distractor triplets... max-margin ranking loss
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...
work page 2022
-
[2]
S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End- to-end, single-stream temporal action detection in untrimmed videos. InBMVC, 2017
work page 2017
-
[4]
Y .-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. InCVPR, 2018
work page 2018
- [5]
- [6]
-
[7]
B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, and A. Shrivastava. Asm-loc: Action-aware segment modeling for weakly-supervised tem- poral action localization. InCVPR, 2022
work page 2022
-
[8]
F. C. Heilbron, V . Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015
work page 2015
-
[9]
J. Hyun, S. H. Han, H. Kang, J.-Y . Lee, and S. J. Kim. Exploring scalability of self-training for open-vocabulary temporal action local- ization. InWACV, 2025
work page 2025
-
[10]
H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Suk- thankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.CVIU, 2017
work page 2017
-
[11]
C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual- language models for efficient video understanding. InECCV, 2021
work page 2021
-
[12]
C. Ju, K. Zheng, J. Liu, P. Zhao, Y . Zhang, J. Chang, Q. Tian, and Y . Wang. Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. InCVPR, 2023
work page 2023
-
[13]
C. Li, J. Chibane, Y . He, N. Pearl, A. Geiger, and G. Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. arXiv, 2024
work page 2024
-
[14]
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023
work page 2023
-
[15]
Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, July 2023. Association for Computational Linguistics
work page 2023
-
[16]
Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks.T- PAMI, 2024
work page 2024
-
[17]
Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks. IEEE TPAMI, 2024
work page 2024
-
[18]
B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024
work page 2024
-
[19]
C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, and Y . Fu. Learning salient boundary feature for anchor-free temporal action localization. InCVPR, 2021
work page 2021
-
[20]
T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. InACMMM, 2017
work page 2017
-
[21]
W. Lin, M. J. Mirza, M. Kozinski, H. Possegger, H. Kuehne, and H. Bischof. Video test-time adaptation for action recognition. In CVPR, 2023
work page 2023
-
[22]
M. Liu, L. Wang, S. Zhou, K. Xia, Q. Wu, Q. Zhang, and G. Hua. Stepwise multi-grained boundary detector for point-supervised tempo- ral action localization. InECCV, 2024
work page 2024
-
[23]
S. Lloyd. Least squares quantization in pcm.IEEE Transactions on Information Theory, 1982
work page 1982
-
[24]
X. Ma, J. Zhang, S. Guo, and W. Xu. Swapprompt: Test-time prompt adaptation for vision-language models. InNeurIPS, 2023
work page 2023
- [25]
- [26]
-
[27]
S. Nag, X. Zhu, Y .-Z. Song, and T. Xiang. Zero-shot temporal action detection via vision-language prompting. InECCV, 2022
work page 2022
-
[28]
Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv, 2023
work page 2023
-
[29]
T. Phan, K. V o, D. Le, G. Doretto, D. Adjeroh, and N. Le. Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. InWACV, 2024
work page 2024
-
[30]
Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y . Qiao, J. Yan, C. Gao, and N. Sang. Temporal context aggregation network for temporal action proposal refinement. InCVPR, 2021
work page 2021
-
[31]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021
work page 2021
-
[32]
N. Reimers. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv, 2019
work page 2019
-
[33]
J. H. A. Samadh, H. Gani, N. H. Hussein, M. U. Khattak, M. Naseer, F. Khan, and S. Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InNeurIPS, 2023
work page 2023
-
[34]
Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. InCVPR, 2017
work page 2017
-
[35]
C. Tao, G. Kwon, V . Gunjal, H. Yang, Z. Cai, Y . Dukler, A. Swami- nathan, R. Manmatha, C. J. Taylor, and S. Soatto. Navero: Unlocking fine-grained semantics for video-language compositionality.arXiv, 2024
work page 2024
-
[36]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023
work page 2023
-
[37]
D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021
work page 2021
-
[38]
L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. InCVPR, 2017
work page 2017
-
[39]
Z. Wang, A. Blume, S. Li, G. Liu, J. Cho, Z. Tang, M. Bansal, and H. Ji. Paxion: Patching action knowledge in video-language foundation models. InNeurIPS, 2023
work page 2023
- [40]
-
[41]
H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. InICCV, 2017
work page 2017
-
[42]
H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. VideoCLIP: Contrastive pre- training for zero-shot video-text understanding. InEMNLP, 2021
work page 2021
-
[43]
S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. Unloc: A unified framework for video localization tasks. InICCV, 2023
work page 2023
-
[44]
L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization.IEEE TIP, 2020
work page 2020
-
[45]
W. Yang, T. Zhang, X. Yu, T. Qi, Y . Zhang, and F. Wu. Uncertainty guided collaborative training for weakly supervised temporal action detection. InCVPR, 2021
work page 2021
-
[46]
C. Yi, S. Yang, Y . Wang, H. Li, Y . peng Tan, and A. Kot. Temporal coherent test time optimization for robust video classification. InICLR, 2023
work page 2023
-
[47]
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu. Coca: Contrastive captioners are image-text foundation models.arXiv, 2022
work page 2022
-
[48]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023
work page 2023
- [49]
-
[50]
M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InNeurIPS, 2022. In this Supplementary Material, we provide additional quantitative and qualitative results. In Sec. A, we report additional results; in Sec. B, we extend the analysis presented in the main paper regarding captions and scene triplets. Following th...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.