pith. sign in

arxiv: 2606.21358 · v1 · pith:OD62UMB6new · submitted 2026-06-19 · 💻 cs.CV

LEViL: Label-Efficient Video Learning via Zero-Shot Distillation over VLM-Generated Pseudo-Label Spaces

Pith reviewed 2026-06-26 14:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords label-efficient video learningzero-shot distillationvision-language modelspseudo-label spacesvideo action recognitionsemi-supervised learningtransfer learningannotation-free pretraining
0
0 comments X

The pith

A framework pretrains video encoders without any manual labels by distilling zero-shot targets from vision-language model descriptions of unlabeled videos, then adapts via target-label-aware fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage method for video action recognition under limited labels. In the first stage, a vision-language model turns unlabeled videos into textual descriptions that form a semantic pseudo-label space; a frozen vision-language model then supplies soft target distributions over this space so a student video encoder can learn without source annotations. In the second stage, the encoder is fine-tuned on the target dataset by combining ordinary supervised loss on the available labels with zero-shot distillation that respects the actual target label set. Experiments on UCF101 and HMDB51 show this approach beats existing semi-supervised baselines in every low-label regime tested and yields useful initializations even when the unlabeled pretraining pool is modest.

Core claim

The LEViL framework performs annotation-free pretraining by having a vision-language model generate textual descriptions of unlabeled videos, constructing an interpretable semantic pseudo-label space from those descriptions, and training a student video encoder via zero-shot soft targets produced by a frozen video-language model over the same space; downstream adaptation then mixes supervised learning on labeled target videos with zero-shot distillation over the true target label set, producing representations that outperform semi-supervised video action recognition baselines on UCF101 and HMDB51 across all tested limited-label settings and serve as effective initializations for full-data fi

What carries the argument

Zero-shot distillation over a VLM-generated semantic pseudo-label space, which supplies soft training targets during annotation-free pretraining and is reused in target-label-set-aware fine-tuning.

If this is right

  • Reliance on large manually labeled source datasets for video pretraining can be removed.
  • Transferable video representations can be obtained from a comparatively small pool of unlabeled videos.
  • Performance gains appear consistently across all tested limited-label regimes on UCF101 and HMDB51.
  • The pretrained encoder supplies a stronger starting point for subsequent full-data fine-tuning than random initialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pseudo-label construction step might be applied to other video tasks such as temporal action localization if the generated descriptions capture timing information.
  • Domain shift between the unlabeled pretraining videos and the target dataset could be further reduced by filtering the pseudo-label space to emphasize categories known to appear in the target label set.
  • Scaling the unlabeled pretraining pool while keeping the VLM fixed would test whether the current gains are limited by data volume or by the quality of the pseudo-label space itself.

Load-bearing premise

Textual descriptions produced by a vision-language model for unlabeled videos can be turned into a semantic pseudo-label space that yields useful zero-shot soft targets for training a video encoder.

What would settle it

Run the pretraining stage on a collection of videos whose VLM descriptions produce a pseudo-label space with no semantic alignment to action categories, then measure whether downstream accuracy still exceeds standard semi-supervised baselines on UCF101 or HMDB51.

Figures

Figures reproduced from arXiv: 2606.21358 by Asl{\i} \c{C}elik.

Figure 1
Figure 1. Figure 1: Word-cloud visualization of the constructed pseudo-label space. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the pseudo-label space construction pipeline. A vision-language model first generates open-vocabulary descriptions of unlabeled video [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of zero-shot distillation over the caption-derived pseudo-label space. A frozen video-language model compares each unlabeled video with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Supervised video pretraining is a common transfer learning practice for improving downstream action recognition performance. However, it requires large-scale labeled source datasets, and the effectiveness of the learned initialization is influenced by the similarity between the source and target domains. Constructing such labeled pretraining datasets for different target domains is costly and difficult to scale. To address these limitations, this study proposes a label-efficient video learning framework that combines annotation-free video pretraining with target-label-set-aware fine-tuning. During pretraining, a vision-language model (VLM) generates textual descriptions of unlabeled videos, which are processed to construct an interpretable semantic pseudo-label space. A frozen video-language model then produces zero-shot soft target distributions over this space, allowing a student video encoder to learn semantically rich representations without manual source annotations. During downstream adaptation, target-label-set-aware fine-tuning combines supervised learning from labeled target videos with zero-shot distillation over the actual target label set, helping preserve VLM-derived semantic guidance while adapting the pretrained encoder to the target task. Experiments on UCF101 and HMDB51 show that the proposed framework outperforms the compared semi-supervised video action recognition methods across all evaluated limited-label regimes. Moreover, the annotation-free pretraining stage learns transferable representations that provide an effective initialization for full-data fine-tuning, despite relying on a comparatively modest unlabeled pretraining pool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes LEViL, a label-efficient video learning framework combining annotation-free pretraining with target-label-set-aware fine-tuning. In pretraining, a VLM generates textual descriptions of unlabeled videos to build an interpretable semantic pseudo-label space; a frozen video-language model then supplies zero-shot soft targets for distilling a student video encoder. Downstream, supervised learning on target labels is combined with zero-shot distillation over the actual target label set. Experiments on UCF101 and HMDB51 are reported to outperform compared semi-supervised video action recognition methods across limited-label regimes, with the pretraining stage also providing effective initialization for full-data fine-tuning despite a modest unlabeled pool.

Significance. If the empirical claims hold, the work demonstrates a practical route to domain-adaptable video pretraining that avoids large labeled source datasets and mitigates domain-shift issues by leveraging VLM-derived semantic structure. The explicit use of a modest unlabeled pretraining pool and the preservation of VLM guidance during fine-tuning are pragmatic strengths that could influence label-efficient pipelines in action recognition and related video tasks.

minor comments (2)
  1. Abstract: the claim of outperformance would be strengthened by naming the specific semi-supervised baselines and reporting at least one quantitative result (e.g., accuracy delta at a given label fraction) rather than a qualitative statement.
  2. The construction of the pseudo-label space from VLM descriptions is described at a high level; a short illustrative example or pseudocode in §3 would improve reproducibility without altering the central pipeline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of LEViL, the recognition of its pragmatic strengths in avoiding large labeled source datasets, and the recommendation for minor revision. We appreciate the constructive framing of the work's potential influence on label-efficient video pipelines.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a pipeline that ingests external VLM outputs to build a pseudo-label space and then performs zero-shot distillation; no equations, fitted parameters, or self-citations are presented that reduce any claimed prediction or result to the method's own inputs by construction. The abstract and described framework contain no self-definitional steps, no renaming of known results, and no load-bearing uniqueness theorems imported from the authors' prior work. The central claim therefore remains independent of internal circular fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5778 in / 1255 out tokens · 32710 ms · 2026-06-26T14:37:19.962331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 linked inside Pith

  1. [1]

    The kinetics human action video dataset,

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

  2. [2]

    Large-scale video classification with convolutional neural networks,

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732

  3. [3]

    Transfer learning for video action recognition: A comparative overview of weight initialization strategies: A. c ¸elik,

    A. C ¸ elik, “Transfer learning for video action recognition: A comparative overview of weight initialization strategies: A. c ¸elik,”Signal, Image and Video Processing, vol. 19, no. 15, p. 1306, 2025

  4. [4]

    Distilling the knowledge in a neural network,

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  5. [5]

    Danet: Semi-supervised differentiated auxiliaries guided network for video action recognition,

    G. Gao, Z. Liu, G. Zhang, J. Li, and A. K. Qin, “Danet: Semi-supervised differentiated auxiliaries guided network for video action recognition,” Neural Networks, vol. 158, pp. 121–131, 2023

  6. [6]

    Distinit: Learning video representations without a single labeled video,

    R. Girdhar, D. Tran, L. Torresani, and D. Ramanan, “Distinit: Learning video representations without a single labeled video,” inProceedings of the IEEE/cvf international conference on computer vision, 2019, pp. 852–861

  7. [7]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  8. [8]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  9. [9]

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  10. [10]

    Frozen in time: A joint video and image encoder for end-to-end retrieval,

    M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738

  11. [11]

    A clip-hitchhiker’s guide to long video retrieval,

    ——, “A clip-hitchhiker’s guide to long video retrieval,”arXiv preprint arXiv:2205.08508, 2022

  12. [12]

    Fine- tuned clip models are efficient video learners,

    H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine- tuned clip models are efficient video learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6545–6554

  13. [13]

    Vl-jepa: Joint em- bedding predictive architecture for vision-language,

    D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, Y . Bang, A. Bolourchi, Y . LeCun, and P. Fung, “Vl-jepa: Joint em- bedding predictive architecture for vision-language,”arXiv preprint arXiv:2512.10942, 2025

  14. [14]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

  15. [15]

    Qwen-vl: A frontier large vision-language model with versatile abilities,

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023

  16. [16]

    Grounding multimodal large language models to the world,

    Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, Q. Ye, and F. Wei, “Grounding multimodal large language models to the world,” in International Conference on Learning Representations, vol. 2024, 2024, pp. 51 575–51 598

  17. [17]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 49 250–49 267, 2023

  18. [18]

    Internvideo2: Scaling foundation models for multimodal video understanding,

    Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y . Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416

  19. [19]

    Videochat: Chat-centric video understanding,

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”Science China Information Sciences, vol. 68, no. 10, p. 200102, 2025

  20. [20]

    Video-chatgpt: Towards detailed video understanding via large vision and language models,

    M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 585– 12 602

  21. [21]

    Video-llama: An instruction-tuned audio- visual language model for video understanding,

    H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio- visual language model for video understanding,” inProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 2023, pp. 543–553

  22. [22]

    Speednet: Learning the speediness in videos,

    S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Ru- binstein, M. Irani, and T. Dekel, “Speednet: Learning the speediness in videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9922–9931

  23. [23]

    Self-supervised spatiotem- poral feature learning via video rotation prediction,

    L. Jing, X. Yang, J. Liu, and Y . Tian, “Self-supervised spatiotem- poral feature learning via video rotation prediction,”arXiv preprint arXiv:1811.11387, 2018

  24. [24]

    Self-supervised video representation learning with space-time cubic puzzles,

    D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8545– 8552

  25. [25]

    Self- supervised spatiotemporal learning via video clip order prediction,

    D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y . Zhuang, “Self- supervised spatiotemporal learning via video clip order prediction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 334–10 343

  26. [26]

    Videomoco: Contrastive video representation learning with temporally adversarial examples,

    T. Pan, Y . Song, T. Yang, W. Jiang, and W. Liu, “Videomoco: Contrastive video representation learning with temporally adversarial examples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 205–11 214

  27. [27]

    Tclr: Temporal con- trastive learning for video representation,

    I. Dave, R. Gupta, M. N. Rizve, and M. Shah, “Tclr: Temporal con- trastive learning for video representation,”Computer Vision and Image Understanding, vol. 219, p. 103406, 2022

  28. [28]

    Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

    Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022

  29. [29]

    Motionmae: Self-supervised video representation learning with motion-aware masked autoencoders

    H. Yang, D. Huang, B. Wen, J. Wu, H. Yao, Y . Jiang, X. Zhu, and Z. Yuan, “Motionmae: Self-supervised video representation learning with motion-aware masked autoencoders.” inBMVC, 2024

  30. [30]

    Videossl: Semi- supervised learning for video classification,

    L. Jing, T. Parag, Z. Wu, Y . Tian, and H. Wang, “Videossl: Semi- supervised learning for video classification,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1110–1119

  31. [31]

    Feature distillation from vision-language model for semisupervised action classification,

    A. Celik, A. K ¨uc ¸¨ukmanisa, and O. Urhan, “Feature distillation from vision-language model for semisupervised action classification,”Turkish Journal of Electrical Engineering and Computer Sciences, vol. 31, no. 6, pp. 1129–1145, 2023

  32. [32]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

  33. [33]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,

    Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”Science China Information Sciences, vol. 67, no. 12, p. 220101, 2024

  34. [34]

    spaCy: Industrial-strength natural language processing in python,

    M. Honnibal, I. Montani, S. V . Landeghem, and A. Boyd, “spaCy: Industrial-strength natural language processing in python,” 2020

  35. [35]

    ChatGPT,

    OpenAI, “ChatGPT,” [Online]. Available: https://chatgpt.com/, 2026, accessed: Jun. 18, 2026

  36. [36]

    Cross-model pseudo-labeling for semi-supervised action recognition,

    Y . Xu, F. Wei, X. Sun, C. Yang, Y . Shen, B. Dai, B. Zhou, and S. Lin, “Cross-model pseudo-labeling for semi-supervised action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2959–2968

  37. [37]

    Learning from temporal gradient for semi-supervised action recognition,

    J. Xiao, L. Jing, L. Zhang, J. He, Q. She, Z. Zhou, A. Yuille, and Y . Li, “Learning from temporal gradient for semi-supervised action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3252–3262

  38. [38]

    Multiview pseudo-labeling for semi-supervised learning from video,

    B. Xiong, H. Fan, K. Grauman, and C. Feichtenhofer, “Multiview pseudo-labeling for semi-supervised learning from video,” inProceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 7209–7219

  39. [39]

    Learn2augment: learning to composite videos for data augmentation in action recognition,

    S. N. Gowda, M. Rohrbach, F. Keller, and L. Sevilla-Lara, “Learn2augment: learning to composite videos for data augmentation in action recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 242–259

  40. [40]

    Learning representational invariances for data-efficient action recognition,

    Y . Zou, J. Choi, Q. Wang, and J.-B. Huang, “Learning representational invariances for data-efficient action recognition,”Computer Vision and Image Understanding, vol. 227, p. 103597, 2023

  41. [41]

    Timebalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition,

    I. R. Dave, M. N. Rizve, C. Chen, and M. Shah, “Timebalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2341–2352

  42. [42]

    Would mega- scale datasets further enhance spatiotemporal 3d cnns?

    H. Kataoka, T. Wakamiya, K. Hara, and Y . Satoh, “Would mega- scale datasets further enhance spatiotemporal 3d cnns?”arXiv preprint arXiv:2004.04968, 2020

  43. [43]

    Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics,

    J. Wang, J. Jiao, L. Bao, S. He, Y . Liu, and W. Liu, “Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4006–4015

  44. [44]

    Self-supervised video representation learning by pace prediction,

    J. Wang, J. Jiao, and Y .-H. Liu, “Self-supervised video representation learning by pace prediction,” inEuropean conference on computer vision. Springer, 2020, pp. 504–521

  45. [45]

    Contrastive spatio-temporal pretext learning for self- supervised video representation,

    Y . Zhang, L.-M. Po, X. Xu, M. Liu, Y . Wang, W. Ou, Y . Zhao, and W.-Y . Yu, “Contrastive spatio-temporal pretext learning for self- supervised video representation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3380–3389

  46. [46]

    Ucf101: A dataset of 101 human actions classes from videos in the wild,

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  47. [47]

    Hmdb: a large video database for human motion recognition,

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in2011 Interna- tional conference on computer vision. IEEE, 2011, pp. 2556–2563

  48. [48]

    Sefar: Semi- supervised fine-grained action recognition with temporal perturbation and learning stabilization,

    Y . Huang, H. Chen, Z. Xu, Z. Jia, H. Sun, and D. Shao, “Sefar: Semi- supervised fine-grained action recognition with temporal perturbation and learning stabilization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 3833–3841

  49. [49]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

  50. [50]

    Svformer: Semi-supervised video transformer for action recognition,

    Z. Xing, Q. Dai, H. Hu, J. Chen, Z. Wu, and Y .-G. Jiang, “Svformer: Semi-supervised video transformer for action recognition,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 816–18 826