Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Peisong Wen; Qianqian Xu; Qingming Huang; Siran Dai; Yang Liu

arxiv: 2606.02321 · v1 · pith:G5MBAKPQnew · submitted 2026-06-01 · 💻 cs.CV

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu , Qianqian Xu , Peisong Wen , Siran Dai , Qingming Huang This is my paper

Pith reviewed 2026-06-28 15:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords composed video retrievaltraining-free retrievalvideo-LLM reasoningvisual candidate selectionDINO visual featuresinstruction followingmultimodal retrieval

0 comments

The pith

A training-free system retrieves videos matching a reference clip plus text modification by first selecting visually similar candidates with DINOv3 then checking them with video-LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that composed video retrieval, where a user supplies both an example video and a textual change instruction, can be solved without any task-specific training. It does so by first using frozen visual encoders to narrow the pool to a small set of visually close videos, then prompting large vision-language models to evaluate which of those candidates actually satisfies the stated modification. The approach is evaluated on the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge test set and reports concrete recall numbers. A sympathetic reader would care because most prior composed-retrieval methods require collecting paired training data and running gradient updates; removing that requirement lowers the barrier to deploying flexible video search.

Core claim

The framework obtains a compact candidate list with frozen DINOv3 visual similarity, then applies video-LLMs to score whether each candidate meets the modification instruction, followed by a final reasoning refinement step on the top-ranked items; without any training this pipeline reaches 48.78 Recall@1 and 51.48 Recall@5 on the challenge test set.

What carries the argument

Visual Representation-Guided Video-LLM Reasoning: a two-stage process that first filters candidates via frozen visual similarity then uses instruction-following video-LLMs to verify the textual modification.

If this is right

Retrieval systems for composed video queries can be built and deployed using only off-the-shelf frozen models.
Performance scales with the quality of the underlying video-LLM without requiring new training runs.
The same candidate-filter-then-reason pattern can be applied to other multimodal retrieval settings that combine an example and a modification instruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the visual encoder and language model disagree on many cases, adding a lightweight calibration step between the two stages could raise recall without introducing training.
The method implicitly assumes the modification instruction is short and explicit; longer or ambiguous instructions may require the LLM stage to be prompted differently.
Because no training occurs, the same pipeline can be tested on new domains simply by swapping the underlying video-LLM.

Load-bearing premise

That visual similarity alone is sufficient to surface a small set of candidates that includes the correct video, and that the video-LLM can then reliably judge which one satisfies the modification text.

What would settle it

On the same test set, replace the DINOv3 candidate stage with random selection of the same number of videos and measure whether Recall@1 drops below 10 percent.

read the original abstract

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper wires DINOv3 candidate filtering to video-LLM reasoning for a training-free entry in the CVPR 2026 composed video retrieval challenge and reports 48.78 R@1, but supplies no evidence that the visual filter retains the targets.

read the letter

The main takeaway is a practical, training-free pipeline for the Reason-Aware Composed Video Retrieval Challenge. It pulls a shortlist of candidates with frozen DINOv3 visual similarity to the reference video, then runs video-LLM reasoning to check which ones satisfy the modification instruction, followed by a refinement pass on the top results.

What the work does is demonstrate that existing models can be combined to reach 48.78 Recall@1 and 51.48 Recall@5 on the test set without any fine-tuning. That is a concrete data point for challenge participants looking for a quick starting system.

The specific three-stage setup is new as an application to this exact challenge, even if it rests on standard components. The abstract is clear about the flow and the numbers.

The soft spot is the untested assumption that DINOv3 similarity will keep the ground-truth target inside the candidate pool. When the modification instruction calls for large visual changes, the target can easily rank low on reference similarity and get filtered out before the LLM stage ever runs. The abstract gives no candidate-stage recall figures or breakdown by modification type, so the reported scores cannot be read as strong evidence for the method.

No baselines, implementation details, or error analysis appear. The paper is a system description rather than a controlled evaluation.

This is mainly useful to teams entering the same 2026 challenge who want an off-the-shelf idea. Readers seeking general methods or rigorous validation will find little here.

I would not send it for peer review. The central claim depends on an assumption that the abstract leaves unexamined.

Referee Report

2 major / 0 minor

Summary. The paper proposes a training-free framework for composed video retrieval that first applies frozen DINOv3 models to retrieve a compact set of visually similar candidate videos given a reference video, then uses video-LLMs to check which candidates satisfy a textual modification instruction, followed by a reasoning-based refinement step on top candidates. It reports achieving 48.78 Recall@1 and 51.48 Recall@5 on the test set of the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge without any training.

Significance. If the reported performance is reproducible and the method generalizes, the work demonstrates that combining frozen visual encoders with off-the-shelf video-LLMs can yield non-trivial results on composed retrieval without task-specific fine-tuning, which would be useful for low-resource or rapid-deployment scenarios. However, the absence of baselines, component ablations, or verification of the filtering stage limits assessment of whether this represents a meaningful advance over existing approaches.

major comments (2)

[Abstract] Abstract: The central performance claims (48.78 R@1, 51.48 R@5) are stated without any baseline comparisons, implementation details (e.g., candidate pool size, LLM prompting strategy, or exact DINOv3 variant), error analysis, or verification that the ground-truth target survives the initial DINOv3 filtering stage for a sufficient fraction of queries. This makes the numbers impossible to interpret or reproduce and directly undermines evaluation of the framework's effectiveness.
[Abstract] Framework description (Abstract): The method assumes that DINOv3 visual similarity to the reference video will place the target video within the compact candidate set even when the modification instruction induces substantial visual changes (different objects, scenes, or motion). No candidate-stage recall statistics or failure-case analysis are provided to support this load-bearing assumption; if the target is frequently filtered out, the subsequent LLM reasoning stage cannot contribute to the reported scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility and address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (48.78 R@1, 51.48 R@5) are stated without any baseline comparisons, implementation details (e.g., candidate pool size, LLM prompting strategy, or exact DINOv3 variant), error analysis, or verification that the ground-truth target survives the initial DINOv3 filtering stage for a sufficient fraction of queries. This makes the numbers impossible to interpret or reproduce and directly undermines evaluation of the framework's effectiveness.

Authors: We agree that the abstract as written lacks sufficient implementation details and supporting analysis for full interpretability. As this is a new challenge task, direct baselines from prior work are limited, but we will add comparisons to simple retrieval baselines in the revision. We will expand the abstract and main text with the requested details (DINOv3 variant, candidate pool size, prompting strategy) and include an error analysis plus verification of filtering-stage recall. These additions will be made in the revised manuscript. revision: yes
Referee: [Abstract] Framework description (Abstract): The method assumes that DINOv3 visual similarity to the reference video will place the target video within the compact candidate set even when the modification instruction induces substantial visual changes (different objects, scenes, or motion). No candidate-stage recall statistics or failure-case analysis are provided to support this load-bearing assumption; if the target is frequently filtered out, the subsequent LLM reasoning stage cannot contribute to the reported scores.

Authors: The assumption is indeed load-bearing for the pipeline. We will add candidate-stage recall statistics (fraction of queries where the ground-truth target is retained after DINOv3 filtering) and a dedicated failure-case analysis section in the revised manuscript to quantify and discuss this aspect. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with external performance metrics

full rationale

The paper presents a training-free composed video retrieval framework that combines frozen DINOv3 for candidate retrieval with video-LLM reasoning and refinement. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text. The reported Recall@1 and Recall@5 values are measured on an external test set and do not reduce to any internal construction or ansatz. The method is a straightforward pipeline description whose validity rests on empirical results rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about pre-trained models and supplies no free parameters or invented entities.

axioms (2)

domain assumption Frozen DINOv3 embeddings produce a compact set of visually relevant candidates for a reference video.
Invoked in the first stage of the framework described in the abstract.
domain assumption Video-LLMs can accurately judge whether a candidate satisfies a textual modification instruction.
Invoked in the second and third stages of the framework described in the abstract.

pith-pipeline@v0.9.1-grok · 5715 in / 1258 out tokens · 39089 ms · 2026-06-28T15:11:57.664049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 9 linked inside Pith

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1

2022
[3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, pages 15619–15629, 2023. 1

2023
[4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1

Pith/arXiv arXiv 2025
[5]

Frozen in time: A joint video and image encoder for end-to- end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. InICCV, pages 1728–1738, 2021. 3

2021
[6]

Is space-time attention all you need for video understanding?

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?
[7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InICCV, pages 9650–9660, 2021. 1

2021
[8]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InICCV, pages 9640–9649, 2021. 1

2021
[9]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

Pith/arXiv arXiv 2025
[10]

Exploring structural degradation in dense repre- sentations for self-supervised learning.NeurIPS, 38:16715– 16764, 2026

Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, and Qing- ming Huang. Exploring structural degradation in dense repre- sentations for self-supervised learning.NeurIPS, 38:16715– 16764, 2026. 1

2026
[11]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, pages 5842–5850, 2017. 1, 3

2017
[12]

Siamese masked autoencoders.NeurIPS, 36:40676–40693, 2023

Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.NeurIPS, 36:40676–40693, 2023. 1

2023
[13]

Momentum contrast for unsupervised visual repre- sentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual repre- sentation learning. InCVPR, pages 9729–9738, 2020. 1

2020
[14]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 1

2022
[15]

Towards understanding action recogni- tion

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recogni- tion. InProceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013. 1

2013
[16]

Causal inference hashing for long-tailed image retrieval.IEEE TIP, 2025

Lu Jin, Zhengyun Lu, Zechao Li, Yonghua Pan, Longquan Dai, Jinhui Tang, and Ramesh Jain. Causal inference hashing for long-tailed image retrieval.IEEE TIP, 2025. 1

2025
[17]

The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

Pith/arXiv arXiv
[18]

Visil: Fine-grained spatio- temporal video similarity learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Visil: Fine-grained spatio- temporal video similarity learning. InICCV, pages 6351– 6360, 2019. 1

2019
[19]

Self-supervised video similarity learning

Giorgos Kordopatis-Zilos, Giorgos Tolias, Christos Tzelepis, Ioannis Kompatsiaris, Ioannis Patras, and Symeon Papadopou- los. Self-supervised video similarity learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4756–4766, 2023. 1

2023
[20]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. pages 19730– 19742. PMLR, 2023. 1

2023
[21]

Not all pairs are equal: Hierarchical learning for average-precision-oriented video retrieval

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, and Qing- ming Huang. Not all pairs are equal: Hierarchical learning for average-precision-oriented video retrieval. InACM MM, pages 3828–3837, 2024. 1

2024
[22]

Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 1

Pith/arXiv arXiv 2024
[23]

When the future becomes the past: Taming temporal correspondence for self-supervised video represen- tation learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, and Qing- ming Huang. When the future becomes the past: Taming temporal correspondence for self-supervised video represen- tation learning. InCVPR, pages 24033–24044, 2025. 1

2025
[24]

Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, and Qingming Huang. Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025. 1

arXiv 2025
[25]

From static to dynamic: Exploring self- supervised image-to-video representation transfer learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, and Qingming Huang. From static to dynamic: Exploring self- supervised image-to-video representation transfer learning. InCVPR, pages 31250–31261, 2026. 1

2026
[26]

Di- nov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Di- nov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 1

2024
[27]

Videomac: Video masked autoencoders meet convnets

Gensheng Pei, Tao Chen, Xiruo Jiang, Huafeng Liu, Zeren Sun, and Yazhou Yao. Videomac: Video masked autoencoders meet convnets. InCVPR, pages 22733–22743, 2024. 1

2024
[28]

The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- beláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 1 4

Pith/arXiv arXiv 2017
[29]

Di- nov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 1

Pith/arXiv arXiv 2025
[30]

Magi-1: Autoregressive video gener- ation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video gener- ation at scale.arXiv preprint arXiv:2505.13211, 2025. 1

Pith/arXiv arXiv 2025
[31]

Covr-r: Reason-aware composed video retrieval.arXiv preprint arXiv:2603.20190, 2026

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, and Fahad Khan. Covr-r: Reason-aware composed video retrieval.arXiv preprint arXiv:2603.20190, 2026. 1

Pith/arXiv arXiv 2026
[32]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR, pages 14549–14560, 2023. 1

2023
[33]

ibot: Image bert pre-training with online tokenizer.ICLR, 2022

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.ICLR, 2022. 1

2022
[34]

Adap- tive temporal encoding network for video instance-level hu- man parsing

Qixian Zhou, Xiaodan Liang, Ke Gong, and Liang Lin. Adap- tive temporal encoding network for video instance-level hu- man parsing. InProceedings of the 26th ACM international conference on Multimedia, pages 1527–1535, 2018. 1 5

2018

[1] [1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1

2022

[3] [3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, pages 15619–15629, 2023. 1

2023

[4] [4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1

Pith/arXiv arXiv 2025

[5] [5]

Frozen in time: A joint video and image encoder for end-to- end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. InICCV, pages 1728–1738, 2021. 3

2021

[6] [6]

Is space-time attention all you need for video understanding?

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?

[7] [7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InICCV, pages 9650–9660, 2021. 1

2021

[8] [8]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InICCV, pages 9640–9649, 2021. 1

2021

[9] [9]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

Pith/arXiv arXiv 2025

[10] [10]

Exploring structural degradation in dense repre- sentations for self-supervised learning.NeurIPS, 38:16715– 16764, 2026

Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, and Qing- ming Huang. Exploring structural degradation in dense repre- sentations for self-supervised learning.NeurIPS, 38:16715– 16764, 2026. 1

2026

[11] [11]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, pages 5842–5850, 2017. 1, 3

2017

[12] [12]

Siamese masked autoencoders.NeurIPS, 36:40676–40693, 2023

Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.NeurIPS, 36:40676–40693, 2023. 1

2023

[13] [13]

Momentum contrast for unsupervised visual repre- sentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual repre- sentation learning. InCVPR, pages 9729–9738, 2020. 1

2020

[14] [14]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 1

2022

[15] [15]

Towards understanding action recogni- tion

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recogni- tion. InProceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013. 1

2013

[16] [16]

Causal inference hashing for long-tailed image retrieval.IEEE TIP, 2025

Lu Jin, Zhengyun Lu, Zechao Li, Yonghua Pan, Longquan Dai, Jinhui Tang, and Ramesh Jain. Causal inference hashing for long-tailed image retrieval.IEEE TIP, 2025. 1

2025

[17] [17]

The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

Pith/arXiv arXiv

[18] [18]

Visil: Fine-grained spatio- temporal video similarity learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Visil: Fine-grained spatio- temporal video similarity learning. InICCV, pages 6351– 6360, 2019. 1

2019

[19] [19]

Self-supervised video similarity learning

Giorgos Kordopatis-Zilos, Giorgos Tolias, Christos Tzelepis, Ioannis Kompatsiaris, Ioannis Patras, and Symeon Papadopou- los. Self-supervised video similarity learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4756–4766, 2023. 1

2023

[20] [20]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. pages 19730– 19742. PMLR, 2023. 1

2023

[21] [21]

Not all pairs are equal: Hierarchical learning for average-precision-oriented video retrieval

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, and Qing- ming Huang. Not all pairs are equal: Hierarchical learning for average-precision-oriented video retrieval. InACM MM, pages 3828–3837, 2024. 1

2024

[22] [22]

Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 1

Pith/arXiv arXiv 2024

[23] [23]

When the future becomes the past: Taming temporal correspondence for self-supervised video represen- tation learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, and Qing- ming Huang. When the future becomes the past: Taming temporal correspondence for self-supervised video represen- tation learning. InCVPR, pages 24033–24044, 2025. 1

2025

[24] [24]

Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, and Qingming Huang. Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025. 1

arXiv 2025

[25] [25]

From static to dynamic: Exploring self- supervised image-to-video representation transfer learning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, and Qingming Huang. From static to dynamic: Exploring self- supervised image-to-video representation transfer learning. InCVPR, pages 31250–31261, 2026. 1

2026

[26] [26]

Di- nov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Di- nov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 1

2024

[27] [27]

Videomac: Video masked autoencoders meet convnets

Gensheng Pei, Tao Chen, Xiruo Jiang, Huafeng Liu, Zeren Sun, and Yazhou Yao. Videomac: Video masked autoencoders meet convnets. InCVPR, pages 22733–22743, 2024. 1

2024

[28] [28]

The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- beláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 1 4

Pith/arXiv arXiv 2017

[29] [29]

Di- nov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 1

Pith/arXiv arXiv 2025

[30] [30]

Magi-1: Autoregressive video gener- ation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video gener- ation at scale.arXiv preprint arXiv:2505.13211, 2025. 1

Pith/arXiv arXiv 2025

[31] [31]

Covr-r: Reason-aware composed video retrieval.arXiv preprint arXiv:2603.20190, 2026

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, and Fahad Khan. Covr-r: Reason-aware composed video retrieval.arXiv preprint arXiv:2603.20190, 2026. 1

Pith/arXiv arXiv 2026

[32] [32]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR, pages 14549–14560, 2023. 1

2023

[33] [33]

ibot: Image bert pre-training with online tokenizer.ICLR, 2022

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.ICLR, 2022. 1

2022

[34] [34]

Adap- tive temporal encoding network for video instance-level hu- man parsing

Qixian Zhou, Xiaodan Liang, Ke Gong, and Liang Lin. Adap- tive temporal encoding network for video instance-level hu- man parsing. InProceedings of the 26th ACM international conference on Multimedia, pages 1527–1535, 2018. 1 5

2018