Zero-Shot Temporal Action Localization Through Textual Guidance

Alessandro Conti; Benedetta Liberatori; Elisa Ricci; Lorenzo Vaquero; Paolo Rota; Yiming Wang

arxiv: 2605.22201 · v1 · pith:K7XGSAC3new · submitted 2026-05-21 · 💻 cs.CV

Zero-Shot Temporal Action Localization Through Textual Guidance

Benedetta Liberatori , Alessandro Conti , Lorenzo Vaquero , Paolo Rota , Yiming Wang , Elisa Ricci This is my paper

Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot temporal action localizationtextual guidancevision-language modelsuntrimmed videosfine-grained action discriminationtraining-free localizationTHUMOS14ActivityNet

0 comments

The pith

Rich textual information from language models enables training-free localization of unseen actions in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that detailed textual cues can replace training data for identifying and timing actions never seen before in zero-shot temporal action localization. It incorporates descriptions generated by large language models together with structured details pulled from captions to supply extra context that helps separate similar actions. Existing approaches either depend on vision-language models that miss fine distinctions or require large labeled video collections that restrict generalization. If this holds, video systems could adapt to novel actions using only language resources instead of new annotated footage.

Core claim

We propose TEGU, a novel approach for zero-shot temporal action localization that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context improves fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets, where it outperforms state-of-the-art ZS-TAL approaches that do not involve training.

What carries the argument

Textual guidance that combines large language model outputs with structured text from captions to supply richer cues for distinguishing actions.

If this is right

Actions never encountered in training can still be localized in untrimmed videos.
Large annotated video datasets are no longer required for competitive zero-shot performance.
Higher accuracy is achieved than other training-free ZS-TAL methods on THUMOS14 and ActivityNet-v1.3.
Linguistic context helps separate fine-grained actions that vision models alone struggle to tell apart.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same textual enrichment could be tested on zero-shot video retrieval or segmentation tasks.
Systems for domains with changing action sets, such as monitoring or sports, could update without new video labels.
Pairing the guidance with light visual adaptation on a few examples might further raise accuracy in practice.

Load-bearing premise

Rich textual information from large language models and captions can provide sufficient cues to discriminate fine-grained actions without any training on video examples.

What would settle it

A direct comparison on videos of closely related actions where removing the textual components leaves performance unchanged or improved would show the added text does not deliver the claimed discrimination benefit.

Figures

Figures reproduced from arXiv: 2605.22201 by Alessandro Conti, Benedetta Liberatori, Elisa Ricci, Lorenzo Vaquero, Paolo Rota, Yiming Wang.

**Figure 1.** Figure 1: Textual guidance for action localization. We propose to use automatically extracted textual cues as an alternative to supervisory signal to adapt a pre-trained vision and language model to the task of temporal action localization. After generating class-level and video-level textual cues, we demonstrate the effectiveness of exploiting their combination to tune a model at test time. strategy along with th… view at source ↗

**Figure 2.** Figure 2: TEGU steps 1-2. Given an input video, TEGU first encodes the class names and its frames into class features and its average video features to get the class predictions at video level. Then, it generates the predicted class’ descriptions and objects . Concurrently, it generates frame captions and parses scene graphs, which are clustered to remove redundancy and get video triplets . Bathing dog Triplets Desc… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of varying the number of S scene triplets. Results are collected on THUMOS14 (50%-50%) and on THUMOS14 (75%-25%) . TABLE III: Ablation on the use of descriptions D and triplets T . Results are collected on THUMOS14 (50%-50%) and on THUMOS14 (75%-25%). TEGU final configuration is highlighted. Setting D T MAP (%) ↑ 0.3 0.4 0.5 0.6 0.7 Avg. 50%-50% 16.9 10.7 6.1 3.2 1.6 7.7 ✓ 17.3 11.7 6.9 3.6 1.8 8.3 … view at source ↗

**Figure 6.** Figure 6: Text-to-text similarities. Cosine similarities between frame-level captions and ground truth video classes, computed on SentenceBERT embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class. frames from background frames. To this end, we first leverage ΦCAP to generate captions for each frame of the videos within THUMOS14. We then compute the… view at source ↗

**Figure 5.** Figure 5: Image-to-text similarities. Cosine similarities between frames and ground truth video classes, computed on VLM embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class. within a short temporal duration just before an action begins and those occurring within a short temporal duration immediately after the action concludes. For this experiment, … view at source ↗

**Figure 7.** Figure 7: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class on ActivityNet-v1.3. GolfSwing TennisSwing BaseballPitch BasketballDunk Billiards CricketBowling SoccerPenalty ThrowDiscus VolleyballSpi… view at source ↗

**Figure 8.** Figure 8: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class. whether the action is actually performed. As a result, the intended meaning of captions is diluted, leading to a higher likelihood of f… view at source ↗

**Figure 9.** Figure 9: Text-to-text similarities on THUMOS14. Cosine similarities between frame-level captions and ground truth video classes, computed on SentenceBERT embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class. We report the average using different captioning models, together with standard deviations. In fig. 12, as we did for THUMOS14 with [PITH_FULL_… view at source ↗

**Figure 10.** Figure 10: Image-to-text similarities on ActivityNet-v1.3. Cosine similarities between frames and ground truth video classes, computed on VLM embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Text-to-text similarities on ActivityNet-v1.3. Cosine similarities between frame-level captions and ground truth video classes, computed on SentenceBERT embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TEGU adds LLM text and caption structure to a frozen VLM for training-free ZS-TAL and reports gains over other no-training baselines, but the localization-specific benefit is not yet clearly separated from better class prompting.

read the letter

The core claim is that richer textual cues from LLMs and structured captions can stand in for missing training data in zero-shot temporal action localization. The authors avoid any fine-tuning and instead feed this extra language context into an existing VLM to improve fine-grained action discrimination on THUMOS14 and ActivityNet-v1.3. That direction is sensible given how hard it is to get labeled video data for every new action class. The no-training constraint is also a real practical advantage over methods that still require large annotated video corpora. Credit to them for focusing on that setting and for trying to make the textual side do more work than simple class-name prompts. The experiments compare against other training-free ZS-TAL baselines, which is the right reference point. If the gains hold in the full tables, that is useful evidence for the subfield. The soft spot is the link between the added text and actual temporal boundary accuracy. Captions and LLM descriptions are usually global or class-level, so it is not obvious how they sharpen frame-level or segment-boundary decisions inside the VLM without an explicit alignment step. If the method is mostly enriching the class prompt while reusing the same frozen temporal modeling as the baselines, then part of the reported improvement could be prompt engineering rather than a solution to the localization problem itself. The abstract does not spell out the exact fusion mechanism or show ablations that isolate the temporal effect, so that part needs checking in the full paper. This is for readers already working on prompt-based or training-free video models who want to see how far language can stretch current VLMs. It is coherent on its own terms and engages the right prior work, so it deserves a serious referee even if revisions will be needed to clarify the localization mechanism.

Referee Report

2 major / 2 minor

Summary. The paper proposes TEGU, a training-free zero-shot temporal action localization (ZS-TAL) method that compensates for absent labeled video data by leveraging rich textual information from large language models and structured text extracted from video captions. This linguistic context is intended to supply finer cues for distinguishing unseen actions and improving both classification and temporal boundary localization. Experiments on THUMOS14 and ActivityNet-v1.3 are reported to show outperformance over prior training-free ZS-TAL baselines that rely on VLMs.

Significance. If the central claim holds, the work would be significant for demonstrating that pre-trained LLMs and caption-derived text can substitute for supervised training in ZS-TAL, offering a path toward more generalizable, annotation-free video localization that builds directly on frozen VLMs without task-specific fine-tuning.

major comments (2)

[Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.
[Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.

minor comments (2)

[Abstract] Abstract: the phrase 'structured text extracted from captions' is used without defining the extraction process or the structure imposed; a brief clarification would aid readability.
[Method] Notation: the integration of textual embeddings with VLM visual features could be denoted more explicitly (e.g., via an equation showing the fusion step) to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our paper. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.

Authors: We appreciate this observation. Our approach leverages LLM-generated descriptions to create more discriminative class representations and uses structured text from video captions to provide contextual cues that aid in distinguishing action boundaries. The integration occurs through enhanced prompting of the VLM, where the textual information is concatenated with segment features to improve zero-shot classification per potential action segment. This indirectly refines localization by better identifying action presence, which is then used for boundary detection via post-processing. We acknowledge that the current description in the method section could be more explicit about this process. In the revised manuscript, we will add a dedicated subsection detailing the prompt construction and how it influences the VLM's temporal decisions without requiring additional training or modeling. revision: yes
Referee: [Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.

Authors: We agree that such ablations would strengthen the claims. Our current results demonstrate improvements in standard ZS-TAL metrics like mAP at various IoU thresholds on THUMOS14 and ActivityNet-v1.3, which inherently require accurate localization. To isolate the effects, we will include new ablation experiments in the revised version that compare performance using standard class names versus the rich textual guidance, reporting both overall mAP and specifically at high IoU (e.g., 0.5 and 0.7) to highlight localization improvements. This will help demonstrate that the textual information contributes to better boundary refinement beyond just classification. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external pre-trained VLMs/LLMs and caption text rather than self-referential fits or derivations

full rationale

The paper introduces TEGU as a training-free ZS-TAL approach that augments frozen VLMs with rich textual cues from LLMs and structured captions to improve fine-grained discrimination and localization. No equations, parameters, or predictions are shown to reduce by construction to the paper's own inputs or prior self-citations. All core components (VLM backbone, LLM-generated descriptions, caption extraction) are external and independently pre-trained, with performance evaluated on standard benchmarks (THUMOS14, ActivityNet-v1.3) against other non-training baselines. This satisfies the self-contained criterion with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach depends on the capabilities of existing pre-trained VLMs and LLMs but introduces no new mathematical axioms or free parameters in the abstract description.

pith-pipeline@v0.9.0 · 5779 in / 972 out tokens · 24571 ms · 2026-05-22T07:50:07.316406+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a novel approach, “Textual Guidance for finer localization of actions in videos” (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions... scene triplets... affine triplets and distractor triplets... max-margin ranking loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...

work page 2022
[2]

S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End- to-end, single-stream temporal action detection in untrimmed videos. InBMVC, 2017

work page 2017
[4]

Y .-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. InCVPR, 2018

work page 2018
[5]

Grill, F

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh- laghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent - a new approach to self-supervised learning. InNeurIPS, 2020

work page 2020
[6]

Gupta, A

A. Gupta, A. Arora, S. Narayan, S. Khan, F. S. Khan, and G. W. Tay- lor. Open-vocabulary temporal action localization using multimodal guidance. InBMVC, 2024

work page 2024
[7]

B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, and A. Shrivastava. Asm-loc: Action-aware segment modeling for weakly-supervised tem- poral action localization. InCVPR, 2022

work page 2022
[8]

F. C. Heilbron, V . Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

work page 2015
[9]

J. Hyun, S. H. Han, H. Kang, J.-Y . Lee, and S. J. Kim. Exploring scalability of self-training for open-vocabulary temporal action local- ization. InWACV, 2025

work page 2025
[10]

in the wild

H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Suk- thankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.CVIU, 2017

work page 2017
[11]

C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual- language models for efficient video understanding. InECCV, 2021

work page 2021
[12]

C. Ju, K. Zheng, J. Liu, P. Zhao, Y . Zhang, J. Chang, Q. Tian, and Y . Wang. Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. InCVPR, 2023

work page 2023
[13]

C. Li, J. Chibane, Y . He, N. Pearl, A. Geiger, and G. Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. arXiv, 2024

work page 2024
[14]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

work page 2023
[15]

Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, July 2023. Association for Computational Linguistics

work page 2023
[16]

Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks.T- PAMI, 2024

work page 2024
[17]

Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks. IEEE TPAMI, 2024

work page 2024
[18]

Liberatori, A

B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024

work page 2024
[19]

C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, and Y . Fu. Learning salient boundary feature for anchor-free temporal action localization. InCVPR, 2021

work page 2021
[20]

T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. InACMMM, 2017

work page 2017
[21]

W. Lin, M. J. Mirza, M. Kozinski, H. Possegger, H. Kuehne, and H. Bischof. Video test-time adaptation for action recognition. In CVPR, 2023

work page 2023
[22]

M. Liu, L. Wang, S. Zhou, K. Xia, Q. Wu, Q. Zhang, and G. Hua. Stepwise multi-grained boundary detector for point-supervised tempo- ral action localization. InECCV, 2024

work page 2024
[23]

S. Lloyd. Least squares quantization in pcm.IEEE Transactions on Information Theory, 1982

work page 1982
[24]

X. Ma, J. Zhang, S. Guo, and W. Xu. Swapprompt: Test-time prompt adaptation for vision-language models. InNeurIPS, 2023

work page 2023
[25]

Manli, N

S. Manli, N. Weili, H. De-An, Y . Zhiding, G. Tom, A. Anima, and X. Chaowei. Test-time prompt tuning for zero-shot generalization in vision-language models. InNeurIPS, 2022

work page 2022
[26]

Momeni, M

L. Momeni, M. Caron, A. Nagrani, A. Zisserman, and C. Schmid. Verbs in action: Improving verb understanding in video-language models. InICCV, 2023

work page 2023
[27]

S. Nag, X. Zhu, Y .-Z. Song, and T. Xiang. Zero-shot temporal action detection via vision-language prompting. InECCV, 2022

work page 2022
[28]

Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv, 2023

work page 2023
[29]

T. Phan, K. V o, D. Le, G. Doretto, D. Adjeroh, and N. Le. Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. InWACV, 2024

work page 2024
[30]

Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y . Qiao, J. Yan, C. Gao, and N. Sang. Temporal context aggregation network for temporal action proposal refinement. InCVPR, 2021

work page 2021
[31]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[32]

N. Reimers. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv, 2019

work page 2019
[33]

J. H. A. Samadh, H. Gani, N. H. Hussein, M. U. Khattak, M. Naseer, F. Khan, and S. Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InNeurIPS, 2023

work page 2023
[34]

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. InCVPR, 2017

work page 2017
[35]

C. Tao, G. Kwon, V . Gunjal, H. Yang, Z. Cai, Y . Dukler, A. Swami- nathan, R. Manmatha, C. J. Taylor, and S. Soatto. Navero: Unlocking fine-grained semantics for video-language compositionality.arXiv, 2024

work page 2024
[36]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

work page 2023
[37]

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

work page 2021
[38]

L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. InCVPR, 2017

work page 2017
[39]

Z. Wang, A. Blume, S. Li, G. Liu, J. Cho, Z. Tang, M. Bansal, and H. Ji. Paxion: Patching action knowledge in video-language foundation models. InNeurIPS, 2023

work page 2023
[40]

Xiong, X

B. Xiong, X. Yang, Y . Song, Y . Wang, and C. Xu. Modality- collaborative test-time adaptation for action recognition. InCVPR, 2024

work page 2024
[41]

H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. InICCV, 2017

work page 2017
[42]

H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. VideoCLIP: Contrastive pre- training for zero-shot video-text understanding. InEMNLP, 2021

work page 2021
[43]

S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. Unloc: A unified framework for video localization tasks. InICCV, 2023

work page 2023
[44]

L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization.IEEE TIP, 2020

work page 2020
[45]

W. Yang, T. Zhang, X. Yu, T. Qi, Y . Zhang, and F. Wu. Uncertainty guided collaborative training for weakly supervised temporal action detection. InCVPR, 2021

work page 2021
[46]

C. Yi, S. Yang, Y . Wang, H. Li, Y . peng Tan, and A. Kot. Temporal coherent test time optimization for robust video classification. InICLR, 2023

work page 2023
[47]

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu. Coca: Contrastive captioners are image-text foundation models.arXiv, 2022

work page 2022
[48]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023

work page 2023
[49]

Zhang, J

C. Zhang, J. Wu, and Y . Li. Actionformer: Localizing moments of actions with transformers. InECCV, 2022

work page 2022
[50]

CliffDiving

M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InNeurIPS, 2022. In this Supplementary Material, we provide additional quantitative and qualitative results. In Sec. A, we report additional results; in Sec. B, we extend the analysis presented in the main paper regarding captions and scene triplets. Following th...

work page 2022

[1] [1]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...

work page 2022

[2] [2]

S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End- to-end, single-stream temporal action detection in untrimmed videos. InBMVC, 2017

work page 2017

[3] [4]

Y .-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. InCVPR, 2018

work page 2018

[4] [5]

Grill, F

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh- laghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent - a new approach to self-supervised learning. InNeurIPS, 2020

work page 2020

[5] [6]

Gupta, A

A. Gupta, A. Arora, S. Narayan, S. Khan, F. S. Khan, and G. W. Tay- lor. Open-vocabulary temporal action localization using multimodal guidance. InBMVC, 2024

work page 2024

[6] [7]

B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, and A. Shrivastava. Asm-loc: Action-aware segment modeling for weakly-supervised tem- poral action localization. InCVPR, 2022

work page 2022

[7] [8]

F. C. Heilbron, V . Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

work page 2015

[8] [9]

J. Hyun, S. H. Han, H. Kang, J.-Y . Lee, and S. J. Kim. Exploring scalability of self-training for open-vocabulary temporal action local- ization. InWACV, 2025

work page 2025

[9] [10]

in the wild

H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Suk- thankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.CVIU, 2017

work page 2017

[10] [11]

C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual- language models for efficient video understanding. InECCV, 2021

work page 2021

[11] [12]

C. Ju, K. Zheng, J. Liu, P. Zhao, Y . Zhang, J. Chang, Q. Tian, and Y . Wang. Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. InCVPR, 2023

work page 2023

[12] [13]

C. Li, J. Chibane, Y . He, N. Pearl, A. Geiger, and G. Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. arXiv, 2024

work page 2024

[13] [14]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

work page 2023

[14] [15]

Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, July 2023. Association for Computational Linguistics

work page 2023

[15] [16]

Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks.T- PAMI, 2024

work page 2024

[16] [17]

Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks. IEEE TPAMI, 2024

work page 2024

[17] [18]

Liberatori, A

B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024

work page 2024

[18] [19]

C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, and Y . Fu. Learning salient boundary feature for anchor-free temporal action localization. InCVPR, 2021

work page 2021

[19] [20]

T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. InACMMM, 2017

work page 2017

[20] [21]

W. Lin, M. J. Mirza, M. Kozinski, H. Possegger, H. Kuehne, and H. Bischof. Video test-time adaptation for action recognition. In CVPR, 2023

work page 2023

[21] [22]

M. Liu, L. Wang, S. Zhou, K. Xia, Q. Wu, Q. Zhang, and G. Hua. Stepwise multi-grained boundary detector for point-supervised tempo- ral action localization. InECCV, 2024

work page 2024

[22] [23]

S. Lloyd. Least squares quantization in pcm.IEEE Transactions on Information Theory, 1982

work page 1982

[23] [24]

X. Ma, J. Zhang, S. Guo, and W. Xu. Swapprompt: Test-time prompt adaptation for vision-language models. InNeurIPS, 2023

work page 2023

[24] [25]

Manli, N

S. Manli, N. Weili, H. De-An, Y . Zhiding, G. Tom, A. Anima, and X. Chaowei. Test-time prompt tuning for zero-shot generalization in vision-language models. InNeurIPS, 2022

work page 2022

[25] [26]

Momeni, M

L. Momeni, M. Caron, A. Nagrani, A. Zisserman, and C. Schmid. Verbs in action: Improving verb understanding in video-language models. InICCV, 2023

work page 2023

[26] [27]

S. Nag, X. Zhu, Y .-Z. Song, and T. Xiang. Zero-shot temporal action detection via vision-language prompting. InECCV, 2022

work page 2022

[27] [28]

Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv, 2023

work page 2023

[28] [29]

T. Phan, K. V o, D. Le, G. Doretto, D. Adjeroh, and N. Le. Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. InWACV, 2024

work page 2024

[29] [30]

Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y . Qiao, J. Yan, C. Gao, and N. Sang. Temporal context aggregation network for temporal action proposal refinement. InCVPR, 2021

work page 2021

[30] [31]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021

[31] [32]

N. Reimers. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv, 2019

work page 2019

[32] [33]

J. H. A. Samadh, H. Gani, N. H. Hussein, M. U. Khattak, M. Naseer, F. Khan, and S. Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InNeurIPS, 2023

work page 2023

[33] [34]

Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. InCVPR, 2017

work page 2017

[34] [35]

C. Tao, G. Kwon, V . Gunjal, H. Yang, Z. Cai, Y . Dukler, A. Swami- nathan, R. Manmatha, C. J. Taylor, and S. Soatto. Navero: Unlocking fine-grained semantics for video-language compositionality.arXiv, 2024

work page 2024

[35] [36]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

work page 2023

[36] [37]

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

work page 2021

[37] [38]

L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. InCVPR, 2017

work page 2017

[38] [39]

Z. Wang, A. Blume, S. Li, G. Liu, J. Cho, Z. Tang, M. Bansal, and H. Ji. Paxion: Patching action knowledge in video-language foundation models. InNeurIPS, 2023

work page 2023

[39] [40]

Xiong, X

B. Xiong, X. Yang, Y . Song, Y . Wang, and C. Xu. Modality- collaborative test-time adaptation for action recognition. InCVPR, 2024

work page 2024

[40] [41]

H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. InICCV, 2017

work page 2017

[41] [42]

H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. VideoCLIP: Contrastive pre- training for zero-shot video-text understanding. InEMNLP, 2021

work page 2021

[42] [43]

S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. Unloc: A unified framework for video localization tasks. InICCV, 2023

work page 2023

[43] [44]

L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization.IEEE TIP, 2020

work page 2020

[44] [45]

W. Yang, T. Zhang, X. Yu, T. Qi, Y . Zhang, and F. Wu. Uncertainty guided collaborative training for weakly supervised temporal action detection. InCVPR, 2021

work page 2021

[45] [46]

C. Yi, S. Yang, Y . Wang, H. Li, Y . peng Tan, and A. Kot. Temporal coherent test time optimization for robust video classification. InICLR, 2023

work page 2023

[46] [47]

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu. Coca: Contrastive captioners are image-text foundation models.arXiv, 2022

work page 2022

[47] [48]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023

work page 2023

[48] [49]

Zhang, J

C. Zhang, J. Wu, and Y . Li. Actionformer: Localizing moments of actions with transformers. InECCV, 2022

work page 2022

[49] [50]

CliffDiving

M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InNeurIPS, 2022. In this Supplementary Material, we provide additional quantitative and qualitative results. In Sec. A, we report additional results; in Sec. B, we extend the analysis presented in the main paper regarding captions and scene triplets. Following th...

work page 2022