pith. sign in

arxiv: 2605.22201 · v1 · pith:K7XGSAC3new · submitted 2026-05-21 · 💻 cs.CV

Zero-Shot Temporal Action Localization Through Textual Guidance

Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot temporal action localizationtextual guidancevision-language modelsuntrimmed videosfine-grained action discriminationtraining-free localizationTHUMOS14ActivityNet
0
0 comments X

The pith

Rich textual information from language models enables training-free localization of unseen actions in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that detailed textual cues can replace training data for identifying and timing actions never seen before in zero-shot temporal action localization. It incorporates descriptions generated by large language models together with structured details pulled from captions to supply extra context that helps separate similar actions. Existing approaches either depend on vision-language models that miss fine distinctions or require large labeled video collections that restrict generalization. If this holds, video systems could adapt to novel actions using only language resources instead of new annotated footage.

Core claim

We propose TEGU, a novel approach for zero-shot temporal action localization that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context improves fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets, where it outperforms state-of-the-art ZS-TAL approaches that do not involve training.

What carries the argument

Textual guidance that combines large language model outputs with structured text from captions to supply richer cues for distinguishing actions.

If this is right

  • Actions never encountered in training can still be localized in untrimmed videos.
  • Large annotated video datasets are no longer required for competitive zero-shot performance.
  • Higher accuracy is achieved than other training-free ZS-TAL methods on THUMOS14 and ActivityNet-v1.3.
  • Linguistic context helps separate fine-grained actions that vision models alone struggle to tell apart.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same textual enrichment could be tested on zero-shot video retrieval or segmentation tasks.
  • Systems for domains with changing action sets, such as monitoring or sports, could update without new video labels.
  • Pairing the guidance with light visual adaptation on a few examples might further raise accuracy in practice.

Load-bearing premise

Rich textual information from large language models and captions can provide sufficient cues to discriminate fine-grained actions without any training on video examples.

What would settle it

A direct comparison on videos of closely related actions where removing the textual components leaves performance unchanged or improved would show the added text does not deliver the claimed discrimination benefit.

Figures

Figures reproduced from arXiv: 2605.22201 by Alessandro Conti, Benedetta Liberatori, Elisa Ricci, Lorenzo Vaquero, Paolo Rota, Yiming Wang.

Figure 1
Figure 1. Figure 1: Textual guidance for action localization. We pro￾pose to use automatically extracted textual cues as an alter￾native to supervisory signal to adapt a pre-trained vision and language model to the task of temporal action localization. After generating class-level and video-level textual cues, we demonstrate the effectiveness of exploiting their combination to tune a model at test time. strategy along with th… view at source ↗
Figure 2
Figure 2. Figure 2: TEGU steps 1-2. Given an input video, TEGU first encodes the class names and its frames into class features and its average video features to get the class predictions at video level. Then, it generates the predicted class’ descriptions and objects . Concurrently, it generates frame captions and parses scene graphs, which are clustered to remove redundancy and get video triplets . Bathing dog Triplets Desc… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of varying the number of S scene triplets. Results are collected on THUMOS14 (50%-50%) and on THUMOS14 (75%-25%) . TABLE III: Ablation on the use of descriptions D and triplets T . Results are collected on THUMOS14 (50%-50%) and on THUMOS14 (75%-25%). TEGU final configuration is highlighted. Setting D T MAP (%) ↑ 0.3 0.4 0.5 0.6 0.7 Avg. 50%-50% 16.9 10.7 6.1 3.2 1.6 7.7 ✓ 17.3 11.7 6.9 3.6 1.8 8.3 … view at source ↗
Figure 6
Figure 6. Figure 6: Text-to-text similarities. Cosine similarities between frame-level captions and ground truth video classes, com￾puted on SentenceBERT embeddings. Numbers are calcu￾lated for foreground , background , and transition frames, and averaged across each class. frames from background frames. To this end, we first lever￾age ΦCAP to generate captions for each frame of the videos within THUMOS14. We then compute the… view at source ↗
Figure 5
Figure 5. Figure 5: Image-to-text similarities. Cosine similarities be￾tween frames and ground truth video classes, computed on VLM embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class. within a short temporal duration just before an action be￾gins and those occurring within a short temporal duration immediately after the action concludes. For this experiment, … view at source ↗
Figure 7
Figure 7. Figure 7: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class on ActivityNet-v1.3. GolfSwing TennisSwing BaseballPitch BasketballDunk Billiards CricketBowling SoccerPenalty ThrowDiscus VolleyballSpi… view at source ↗
Figure 8
Figure 8. Figure 8: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class. whether the action is actually performed. As a result, the intended meaning of captions is diluted, leading to a higher likelihood of f… view at source ↗
Figure 9
Figure 9. Figure 9: Text-to-text similarities on THUMOS14. Cosine similarities between frame-level captions and ground truth video classes, computed on SentenceBERT embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class. We report the average using different captioning models, together with standard deviations. In fig. 12, as we did for THUMOS14 with [PITH_FULL_… view at source ↗
Figure 10
Figure 10. Figure 10: Image-to-text similarities on ActivityNet-v1.3. Cosine similarities between frames and ground truth video classes, computed on VLM embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Text-to-text similarities on ActivityNet-v1.3. Cosine similarities between frame-level captions and ground truth video classes, computed on SentenceBERT embeddings. Numbers are calculated for foreground , background , and transition frames, and averaged across each class [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Text-to-text similarities in non-foreground frames. Cosine similarities between ground truth video classes and (i) frame-level captions grouped by background and transition and (ii) scene triplets, again grouped by background and transition . Values are averaged per-class [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TEGU, a training-free zero-shot temporal action localization (ZS-TAL) method that compensates for absent labeled video data by leveraging rich textual information from large language models and structured text extracted from video captions. This linguistic context is intended to supply finer cues for distinguishing unseen actions and improving both classification and temporal boundary localization. Experiments on THUMOS14 and ActivityNet-v1.3 are reported to show outperformance over prior training-free ZS-TAL baselines that rely on VLMs.

Significance. If the central claim holds, the work would be significant for demonstrating that pre-trained LLMs and caption-derived text can substitute for supervised training in ZS-TAL, offering a path toward more generalizable, annotation-free video localization that builds directly on frozen VLMs without task-specific fine-tuning.

major comments (2)
  1. [Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.
  2. [Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'structured text extracted from captions' is used without defining the extraction process or the structure imposed; a brief clarification would aid readability.
  2. [Method] Notation: the integration of textual embeddings with VLM visual features could be denoted more explicitly (e.g., via an equation showing the fusion step) to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our paper. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section: the description of how LLM-generated descriptions and structured caption text achieve temporal alignment or boundary refinement is insufficient. Global or class-level textual cues are not segment-aligned by default, so it is unclear what mechanism (e.g., prompt engineering, attention modulation, or auxiliary alignment) allows them to refine frame-level or boundary decisions inside the VLM without additional temporal modeling; this directly bears on whether textual guidance genuinely compensates for missing supervision in localization.

    Authors: We appreciate this observation. Our approach leverages LLM-generated descriptions to create more discriminative class representations and uses structured text from video captions to provide contextual cues that aid in distinguishing action boundaries. The integration occurs through enhanced prompting of the VLM, where the textual information is concatenated with segment features to improve zero-shot classification per potential action segment. This indirectly refines localization by better identifying action presence, which is then used for boundary detection via post-processing. We acknowledge that the current description in the method section could be more explicit about this process. In the revised manuscript, we will add a dedicated subsection detailing the prompt construction and how it influences the VLM's temporal decisions without requiring additional training or modeling. revision: yes

  2. Referee: [Experiments] Experiments section (results tables): the reported gains over baselines should be accompanied by ablations that isolate the contribution of textual guidance to localization metrics (e.g., boundary precision or mAP at high IoU thresholds) versus classification accuracy alone. Without such breakdown it remains possible that improvements stem primarily from richer class prompts rather than localization-specific benefits.

    Authors: We agree that such ablations would strengthen the claims. Our current results demonstrate improvements in standard ZS-TAL metrics like mAP at various IoU thresholds on THUMOS14 and ActivityNet-v1.3, which inherently require accurate localization. To isolate the effects, we will include new ablation experiments in the revised version that compare performance using standard class names versus the rich textual guidance, reporting both overall mAP and specifically at high IoU (e.g., 0.5 and 0.7) to highlight localization improvements. This will help demonstrate that the textual information contributes to better boundary refinement beyond just classification. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external pre-trained VLMs/LLMs and caption text rather than self-referential fits or derivations

full rationale

The paper introduces TEGU as a training-free ZS-TAL approach that augments frozen VLMs with rich textual cues from LLMs and structured captions to improve fine-grained discrimination and localization. No equations, parameters, or predictions are shown to reduce by construction to the paper's own inputs or prior self-citations. All core components (VLM backbone, LLM-generated descriptions, caption extraction) are external and independently pre-trained, with performance evaluated on standard benchmarks (THUMOS14, ActivityNet-v1.3) against other non-training baselines. This satisfies the self-contained criterion with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach depends on the capabilities of existing pre-trained VLMs and LLMs but introduces no new mathematical axioms or free parameters in the abstract description.

pith-pipeline@v0.9.0 · 5779 in / 972 out tokens · 24571 ms · 2026-05-22T07:50:07.316406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we propose a novel approach, “Textual Guidance for finer localization of actions in videos” (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions... scene triplets... affine triplets and distractor triplets... max-margin ranking loss

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...

  2. [2]

    S. Buch, V . Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles. End- to-end, single-stream temporal action detection in untrimmed videos. InBMVC, 2017

  3. [4]

    Y .-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. InCVPR, 2018

  4. [5]

    Grill, F

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh- laghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent - a new approach to self-supervised learning. InNeurIPS, 2020

  5. [6]

    Gupta, A

    A. Gupta, A. Arora, S. Narayan, S. Khan, F. S. Khan, and G. W. Tay- lor. Open-vocabulary temporal action localization using multimodal guidance. InBMVC, 2024

  6. [7]

    B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, and A. Shrivastava. Asm-loc: Action-aware segment modeling for weakly-supervised tem- poral action localization. InCVPR, 2022

  7. [8]

    F. C. Heilbron, V . Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

  8. [9]

    J. Hyun, S. H. Han, H. Kang, J.-Y . Lee, and S. J. Kim. Exploring scalability of self-training for open-vocabulary temporal action local- ization. InWACV, 2025

  9. [10]

    in the wild

    H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Suk- thankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.CVIU, 2017

  10. [11]

    C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie. Prompting visual- language models for efficient video understanding. InECCV, 2021

  11. [12]

    C. Ju, K. Zheng, J. Liu, P. Zhao, Y . Zhang, J. Chang, Q. Tian, and Y . Wang. Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. InCVPR, 2023

  12. [13]

    C. Li, J. Chibane, Y . He, N. Pearl, A. Geiger, and G. Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. arXiv, 2024

  13. [14]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

  14. [15]

    Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, July 2023. Association for Computational Linguistics

  15. [16]

    Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks.T- PAMI, 2024

  16. [17]

    Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang. Detal: Open- vocabulary temporal action localization with decoupled networks. IEEE TPAMI, 2024

  17. [18]

    Liberatori, A

    B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024

  18. [19]

    C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, and Y . Fu. Learning salient boundary feature for anchor-free temporal action localization. InCVPR, 2021

  19. [20]

    T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. InACMMM, 2017

  20. [21]

    W. Lin, M. J. Mirza, M. Kozinski, H. Possegger, H. Kuehne, and H. Bischof. Video test-time adaptation for action recognition. In CVPR, 2023

  21. [22]

    M. Liu, L. Wang, S. Zhou, K. Xia, Q. Wu, Q. Zhang, and G. Hua. Stepwise multi-grained boundary detector for point-supervised tempo- ral action localization. InECCV, 2024

  22. [23]

    S. Lloyd. Least squares quantization in pcm.IEEE Transactions on Information Theory, 1982

  23. [24]

    X. Ma, J. Zhang, S. Guo, and W. Xu. Swapprompt: Test-time prompt adaptation for vision-language models. InNeurIPS, 2023

  24. [25]

    Manli, N

    S. Manli, N. Weili, H. De-An, Y . Zhiding, G. Tom, A. Anima, and X. Chaowei. Test-time prompt tuning for zero-shot generalization in vision-language models. InNeurIPS, 2022

  25. [26]

    Momeni, M

    L. Momeni, M. Caron, A. Nagrani, A. Zisserman, and C. Schmid. Verbs in action: Improving verb understanding in video-language models. InICCV, 2023

  26. [27]

    S. Nag, X. Zhu, Y .-Z. Song, and T. Xiang. Zero-shot temporal action detection via vision-language prompting. InECCV, 2022

  27. [28]

    Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv, 2023

  28. [29]

    T. Phan, K. V o, D. Le, G. Doretto, D. Adjeroh, and N. Le. Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. InWACV, 2024

  29. [30]

    Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y . Qiao, J. Yan, C. Gao, and N. Sang. Temporal context aggregation network for temporal action proposal refinement. InCVPR, 2021

  30. [31]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  31. [32]

    N. Reimers. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv, 2019

  32. [33]

    J. H. A. Samadh, H. Gani, N. H. Hussein, M. U. Khattak, M. Naseer, F. Khan, and S. Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InNeurIPS, 2023

  33. [34]

    Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. InCVPR, 2017

  34. [35]

    C. Tao, G. Kwon, V . Gunjal, H. Yang, Z. Cai, Y . Dukler, A. Swami- nathan, R. Manmatha, C. J. Taylor, and S. Soatto. Navero: Unlocking fine-grained semantics for video-language compositionality.arXiv, 2024

  35. [36]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

  36. [37]

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

  37. [38]

    L. Wang, Y . Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. InCVPR, 2017

  38. [39]

    Z. Wang, A. Blume, S. Li, G. Liu, J. Cho, Z. Tang, M. Bansal, and H. Ji. Paxion: Patching action knowledge in video-language foundation models. InNeurIPS, 2023

  39. [40]

    Xiong, X

    B. Xiong, X. Yang, Y . Song, Y . Wang, and C. Xu. Modality- collaborative test-time adaptation for action recognition. InCVPR, 2024

  40. [41]

    H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. InICCV, 2017

  41. [42]

    H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. VideoCLIP: Contrastive pre- training for zero-shot video-text understanding. InEMNLP, 2021

  42. [43]

    S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. Unloc: A unified framework for video localization tasks. InICCV, 2023

  43. [44]

    L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization.IEEE TIP, 2020

  44. [45]

    W. Yang, T. Zhang, X. Yu, T. Qi, Y . Zhang, and F. Wu. Uncertainty guided collaborative training for weakly supervised temporal action detection. InCVPR, 2021

  45. [46]

    C. Yi, S. Yang, Y . Wang, H. Li, Y . peng Tan, and A. Kot. Temporal coherent test time optimization for robust video classification. InICLR, 2023

  46. [47]

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu. Coca: Contrastive captioners are image-text foundation models.arXiv, 2022

  47. [48]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023

  48. [49]

    Zhang, J

    C. Zhang, J. Wu, and Y . Li. Actionformer: Localizing moments of actions with transformers. InECCV, 2022

  49. [50]

    CliffDiving

    M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InNeurIPS, 2022. In this Supplementary Material, we provide additional quantitative and qualitative results. In Sec. A, we report additional results; in Sec. B, we extend the analysis presented in the main paper regarding captions and scene triplets. Following th...