arxiv: 2605.07064 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Learning to Track Instance from Single Nature Language Description

Yaozong Zheng , Bineng Zhong , Qihua Liang , Shuimu Zeng , Haiying Xia , Shuxiang Song

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised VL trackingdynamic token aggregationvision-language trackingobject trackingnatural language descriptionno bounding box annotations

0 comments

The pith

A Dynamic Token Aggregation Module enables self-supervised tracking of any object referred to by natural language in video without bounding box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tracker, a self-supervised vision-language tracker that follows objects described in natural language using only unlabeled videos. It does so by proposing a Dynamic Token Aggregation Module that selects important visual tokens from a template frame via an anchor token and attention scores, merges them into language tokens to cut noise and boost alignment, and then deploys the fused tokens to locate targets in search frames while propagating guidance forward in time. This setup lets the model learn instance tracking representations autonomously. Experiments on VL tracking benchmarks show it exceeds prior self-supervised methods.

Core claim

Tracker demonstrates that self-supervised VL tracking can be achieved by unequally treating visual tokens: selecting multiple important target tokens from the template frame using an anchor token, merging them according to attention scores and aggregating into language tokens to eliminate redundant visual token noise and enhance semantic alignment, and then using the fused language tokens as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames.

What carries the argument

Dynamic Token Aggregation Module, which selects key visual tokens from the template using an anchor and attention scores, merges them into language tokens, and uses the result to guide target extraction and temporal propagation in search frames.

If this is right

Trackers can be trained on large-scale unlabeled video data guided only by natural language descriptions.
Semantic alignment between vision and language improves by discarding redundant visual token noise through selective aggregation.
Temporal consistency across frames strengthens when fused language tokens serve as prompts for subsequent search frames.
Self-supervised VL tracking can outperform methods that fuse all language and visual tokens equally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective token approach may extend to other video-language tasks such as action localization or video question answering to reduce reliance on dense annotations.
Training at internet-video scale becomes feasible if attention-based selection proves robust across diverse domains without box labels.
The mechanism could yield more interpretable trackers by revealing which visual tokens language descriptions activate at each step.

Load-bearing premise

The Dynamic Token Aggregation Module can effectively select and aggregate tokens to enhance semantic alignment and temporal prompts without any supervision from bounding boxes, relying only on the attention scores and language guidance.

What would settle it

An experiment on VL tracking benchmarks in which Tracker does not surpass existing self-supervised methods, or a qualitative case where the tokens selected and aggregated by the module fail to match the object referred to in the language description.

Figures

Figures reproduced from arXiv: 2605.07064 by Bineng Zhong, Haiying Xia, Qihua Liang, Shuimu Zeng, Shuxiang Song, Yaozong Zheng.

**Figure 3.** Figure 3: Overview of the proposed SVLTrack framework. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Our Dynamic Token Aggregation Module. It treats each visual token unequally, following three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens. iii) The fused language tokens serve as guiding signals to extract potential target tok… view at source ↗

**Figure 5.** Figure 5: Visualization of the instance tokens sampled from the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of different denoising ratios. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a self-supervised language-guided tracker with a dynamic token aggregation module that selects and merges visual tokens unequally, but the bootstrap for reliable attention-based alignment without labels looks underspecified.

read the letter

The core idea here is a self-supervised vision-language tracker that learns to follow objects from natural language descriptions alone, without any bounding box labels. They call the model Tracker and build it around a Dynamic Token Aggregation Module with three steps: pick key target tokens from the template frame using an anchor, merge them into language tokens based on attention scores to reduce noise and improve alignment, then use the fused tokens to extract targets from search frames and propagate across time. This setup is meant to let the model train on unlabeled video plus language and still learn instance tracking. The unequal treatment of tokens instead of uniform fusion is the main technical move, and it directly targets the problem of annotation cost in VL tracking. That framing is useful and the module description is concrete enough to implement from the abstract. The claim that it beats prior self-supervised methods on standard benchmarks is stated plainly. The soft spot is the self-supervision loop itself. Selection and merging rely on attention scores and language guidance with no other signal mentioned, so the approach implicitly assumes that early attention already highlights the referred object rather than background or distractors. If that fails at the start, the fused tokens become noise and the temporal propagation just reinforces errors. Nothing in the description covers initialization, an auxiliary loss, or a staged training trick that would make the assumption reliable. Without seeing the full experiments, ablations, or exact metrics, it's hard to tell whether the reported gains actually come from the module or from other unstated choices. This paper is for people working on multimodal tracking or self-supervised methods in computer vision who want to reduce reliance on box annotations. A reader focused on token-efficient fusion or language-guided video models could pick up the aggregation pattern. It deserves peer review so the experimental controls and the actual self-supervision mechanism can be checked in detail.

Referee Report

1 major / 3 minor

Summary. The paper introduces Tracker, a self-supervised vision-language tracker capable of tracking any object referred to by a natural language description in a video sequence without any bounding-box annotations. It proposes a Dynamic Token Aggregation Module consisting of three steps: anchor-based selection of target tokens from the template frame, merging selected tokens into language tokens via attention scores to enhance semantic alignment, and using the fused tokens as guidance to extract and propagate target tokens across search frames for temporal prompts. The approach enables self-supervised learning of language-guided tracking representations, with claims of surpassing state-of-the-art self-supervised methods on VL tracking benchmarks.

Significance. If the self-supervised mechanism proves effective, the work would offer a meaningful reduction in annotation costs for vision-language tracking by demonstrating that language descriptions alone can bootstrap instance tracking representations from unlabeled video data.

major comments (1)

[Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.

minor comments (3)

[Title] The title contains 'Nature Language' which should read 'Natural Language'.
[Abstract] The abstract's claim that Tracker 'surpasses SOTA self-supervised methods' would be more informative if it referenced specific benchmarks, metrics, or a results table.
[Abstract] LaTeX artifacts such as '{' and '}' around tracker references should be cleaned for readability in the final version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address the major comment on the self-supervised mechanism below and commit to improving the clarity of the manuscript.

read point-by-point responses

Referee: [Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.

Authors: We appreciate the referee highlighting this foundational assumption. In our approach, the language description directly initializes the process: cross-attention is computed between the language tokens and all visual tokens in the template frame to identify and select the anchor token corresponding to the referred instance. This language-guided selection provides the starting bias for the attention maps without requiring bounding-box annotations, auxiliary losses, or separate validation steps. The subsequent merging and propagation steps then reinforce this alignment through temporal consistency across unlabeled frames. We agree the current description in the abstract and method section is insufficiently explicit on this initialization. We will revise the manuscript to elaborate on the language-to-visual attention for anchor selection, add a dedicated paragraph on the bootstrapping mechanism, and include attention map visualizations from early training stages to demonstrate reliable focus on the target instance. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The paper introduces Tracker via a novel Dynamic Token Aggregation Module whose three steps (anchor-based token selection, attention-score merging into language tokens, and language-guided extraction) are presented as an independent architectural choice enabling self-supervised learning from unlabeled video plus language descriptions. No equations, loss definitions, or claimed predictions in the abstract or described method reduce by construction to fitted inputs, self-citations, or renamed prior results; the bootstrap relies on the proposed attention mechanism itself rather than presupposing the target output. The central claim of surpassing SOTA self-supervised methods is positioned as empirically validated rather than definitionally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard transformer attention assumptions but introduces no new free parameters or entities explicitly in the abstract.

axioms (1)

domain assumption The attention mechanism in transformers can be used to select important visual tokens based on language guidance
Implicit in the token aggregation step described in abstract.

pith-pipeline@v0.9.0 · 5547 in / 1337 out tokens · 33924 ms · 2026-05-11T01:40:41.137346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InCVPR, pages 8126–8135, 2021. 6

work page 2021
[2]

Mixformer: End-to-end tracking with iterative mixed atten- tion

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InCVPR, pages 13598–13608, 2022. 2

work page 2022
[3]

Unbiased missing-modality multimodal learning

Ruiting Dai, Chenxi Li, Yandong Yan, Lisi Mo, Ke Qin, and Tao He. Unbiased missing-modality multimodal learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24507–24517, 2025. 12

work page 2025
[4]

ECO: efficient convolution operators for tracking

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ECO: efficient convolution operators for tracking. InCVPR, pages 6931–6939, 2017. 6

work page 2017
[5]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 6

work page 2019
[6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 6

work page 2021
[7]

Lasot: A high-quality benchmark for large-scale single ob- ject tracking

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InCVPR, pages 5374–5383, 2019. 1, 6, 7

work page 2019
[8]

Lasot: A high-quality large-scale single object tracking benchmark

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. Lasot: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis., pages 439–461, 2021. 7

work page 2021
[9]

Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers

Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff. Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers. InCVPR, pages 5851–5860, 2021. 6

work page 2021
[10]

Memvlt: Vision-language tracking with adaptive memory- based prompts

Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang. Memvlt: Vision-language tracking with adaptive memory- based prompts. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 6

work page 2024
[11]

Zijian Gao, Kele Xu, Xingxing Zhang, Huiping Zhuang, Tianjiao Wan, Bo Ding, Xinjun Mao, and Wang Huaimin. Rethinking obscured sub-optimality in analytic learning for exemplar-free class-incremental learning.IEEE Transac- tions on Circuits and Systems for Video Technology, 36(10): 1123–1136, 2025. 12

work page 2025
[12]

Consistencies are all you need for semi-supervised vision-language tracking

Jiawei Ge, Jiuxin Cao, Xuelin Zhu, Xinyu Zhang, Chang Liu, Kun Wang, and Bo Liu. Consistencies are all you need for semi-supervised vision-language tracking. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 1895–1904, 2024. 3, 6, 7

work page 1904
[13]

A survey of wireless sensing security from a role-based view.IEEE Communica- tions Surveys & Tutorials, 2025

Ruixu Geng, Jianyang Wang, Yuqin Yuan, Fengquan Zhan, Tianyu Zhang, Rui Zhang, Pengcheng Huang, Dongheng Zhang, Jinbo Chen, Yang Hu, et al. A survey of wireless sensing security from a role-based view.IEEE Communica- tions Surveys & Tutorials, 2025. 12

work page 2025
[14]

Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022

Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing. Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022. 6

work page arXiv 2022
[15]

Seg- ment concealed object with incomplete supervision.TPAMI,

Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Longxi- ang Tang, Yulun Zhang, Linghe Kong, and Sina Farsiu. Seg- ment concealed object with incomplete supervision.TPAMI,

work page
[16]

Reversible unfold- ing network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

Chunming He, Fengyang Xiao, Rihan Zhang, Chengyu Fang, Deng-Ping Fan, and Sina Farsiu. Reversible unfold- ing network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

work page arXiv 2025
[17]

Masked video pretrain- ing advances real-world video denoising.IEEE Transactions on Multimedia, 27:622–636, 2025

Yi Jin, Xiaoxiao Ma, Rui Zhang, Huaian Chen, Yuxuan Gu, Pengyang Ling, and Enhong Chen. Masked video pretrain- ing advances real-world video denoising.IEEE Transactions on Multimedia, 27:622–636, 2025. 12

work page 2025
[18]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 3, 4, 8

work page 2024
[19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[20]

Self-supervised tracking via target- aware data synthesis.IEEE Transactions on Neural Net- works and Learning Systems, 2023

Xin Li, Wenjie Pei, Yaowei Wang, Zhenyu He, Huchuan Lu, and Ming-Hsuan Yang. Self-supervised tracking via target- aware data synthesis.IEEE Transactions on Neural Net- works and Learning Systems, 2023. 3, 6

work page 2023
[21]

Dtllm-vlt: Diverse text generation for visual language tracking based on llm

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7283–7292, 2024. 2, 6

work page 2024
[22]

Dynamic updates for language adaptation in visual-language tracking

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 3, 6

work page 2025
[23]

Cross-modal augmentation for low- resource language understanding and generation

Zichao Li and Zong Ke. Cross-modal augmentation for low- resource language understanding and generation. InPro- ceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 90–99, 2025. 12

work page 2025
[24]

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Tracking by natural language specification. InCVPR, pages 7350–7358, 2017. 1, 2, 3, 6, 7

work page 2017
[25]

Progressive semantic-visual alignment and re- finement for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024

Yanjie Liang, Qiangqiang Wu, Lin Cheng, Changqun Xia, and Jia Li. Progressive semantic-visual alignment and re- finement for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 6

work page 2024
[26]

Girshick, Kaiming He, and Piotr Doll ´ar

Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017. 6

work page 2017
[27]

Learning by analogy: Reliable supervi- sion from transformations for unsupervised optical flow es- timation

Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. Learning by analogy: Reliable supervi- sion from transformations for unsupervised optical flow es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6489–6498,

work page
[28]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6

work page 2019
[29]

Tf-icon: Diffusion-based training-free cross-domain image composi- tion

Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composi- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 2294–2305, 2023. 12

work page 2023
[30]

Mace: Mass concept erasure in diffu- sion models

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 12

work page 2024
[31]

Unifying visual and vision-language tracking via contrastive learning

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang. Unifying visual and vision-language tracking via contrastive learning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4107–4116, 2024. 1, 2, 6, 8

work page 2024
[32]

Towards real- istic data generation for real-world super-resolution

Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution. InThe Thirteenth International Conference on Learning Represen- tations. 12

work page
[33]

Lightweight adaptive feature de-drifting for compressed im- age classification.IEEE Transactions on Multimedia, 26: 6424–6436, 2024

Long Peng, Yang Cao, Yuejin Sun, and Yang Wang. Lightweight adaptive feature de-drifting for compressed im- age classification.IEEE Transactions on Multimedia, 26: 6424–6436, 2024

work page 2024
[34]

Shiqing Qiu, Haoyu Wang, Yuxin Zhang, Zong Ke, and Zichao Li. Convex optimization of markov decision pro- cesses based on z transform: A theoretical framework for two-space decomposition and linear programming recon- struction.Mathematics, 13(11):1765, 2025. 12

work page 2025
[35]

Reid, and Silvio Savarese

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InCVPR, pages 658–666, 2019. 6

work page 2019
[36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[37]

Context-aware integration of lan- guage and visual references for natural language tracking

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. Context-aware integration of lan- guage and visual references for natural language tracking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19208–19217, 2024. 1

work page 2024
[38]

Unsupervised learning of accurate siamese tracking

Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8101–8110, 2022. 3, 6

work page 2022
[39]

Aligning and prompting everything all at once for univer- sal visual perception

Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13193–13203, 2024. 3, 4, 8

work page 2024
[40]

Transformer tracking with cyclic shifting window attention

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,

work page
[41]

Compact transformer tracker with correlative masked modeling

Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 12

work page 2023
[42]

Chat- tracker: Enhancing visual tracking performance via chatting with multimodal large language model.Advances in Neural Information Processing Systems, 37:39303–39324, 2024

Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Yang Li, Chenhui Li, and Changbo Wang. Chat- tracker: Enhancing visual tracking performance via chatting with multimodal large language model.Advances in Neural Information Processing Systems, 37:39303–39324, 2024. 2

work page 2024
[43]

Unsupervised deep tracking

Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1308–1317, 2019. 3, 6

work page 2019
[44]

Unsupervised deep representation learning for real-time tracking.International Journal of Computer Vision, 129(2):400–418, 2021

Ning Wang, Wengang Zhou, Yibing Song, Chao Ma, Wei Liu, and Houqiang Li. Unsupervised deep representation learning for real-time tracking.International Journal of Computer Vision, 129(2):400–418, 2021. 6

work page 2021
[45]

Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InCVPR, pages 13763–13773, 2021. 1, 2, 6, 7

work page 2021
[46]

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B. Chan. Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks. CVPR, abs/2304.00571, 2023. 6

work page arXiv 2023
[47]

Autoregressive visual tracking

Wei Xing, Bai Yifan, Zheng Yongchao, Shi Dahu, and Gong Yihong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9697–9706, 2023. 6

work page 2023
[48]

Learning spatio-temporal transformer for vi- sual tracking

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InICCV, pages 10428–10437, 2021. 2, 6

work page 2021
[49]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2025. 3

work page 2025
[50]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InECCV (22), pages 341–357, 2022. 5, 6

work page 2022
[51]

Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition

Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 12

work page 2025
[52]

Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020

Di Yuan, Xiaojun Chang, Po-Yao Huang, Qiao Liu, and Zhenyu He. Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020. 3

work page 2020
[53]

X2-vlm: All-in-one pre- trained model for vision-language tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and Wangchunshu Zhou. X2-vlm: All-in-one pre- trained model for vision-language tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 3

work page 2023
[54]

All in one: Exploring uni- fied vision-language tracking with multi-modal alignment

Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. All in one: Exploring uni- fied vision-language tracking with multi-modal alignment. InProceedings of the 31st ACM International Conference on Multimedia, pages 5552–5561, 2023. 6

work page 2023
[55]

Rf- mamba: Frequency-aware state space model for rf-based human-centric perception

Rui Zhang, Ruixu Geng, Yadong Li, Ruiyuan Song, Han- qin Gong, Dongheng Zhang, Yang Hu, and Yan Chen. Rf- mamba: Frequency-aware state space model for rf-based human-centric perception. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[56]

Diff-tracker: text-to-image diffusion models are un- supervised trackers

Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, and Jun Liu. Diff-tracker: text-to-image diffusion models are un- supervised trackers. InEuropean Conference on Computer Vision, pages 319–337. Springer, 2025. 3, 6

work page 2025
[57]

Learning to track objects from unlabeled videos

Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. Learning to track objects from unlabeled videos. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 13546–13555, 2021. 3, 6

work page 2021
[58]

Leveraging local and global cues for visual tracking via parallel interaction network

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2

work page 2022
[59]

Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 1, 6

work page 2023
[60]

Odtrack: Online dense temporal token learning for visual tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 7588–7596, 2024. 5, 6

work page 2024
[61]

Decoupled spatio-temporal consistency learning for self-supervised tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 3, 6

work page 2025
[62]

Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2

work page 2025
[63]

Comem: Compositional concept- graph memory for vision–language adaptation

Heng Zhou, Jing Tang, Yanshu Li, Canran Xiao, Liwei Hou, Zong Ke, Jiawei Yao, et al. Comem: Compositional concept- graph memory for vision–language adaptation. InThe Four- teenth International Conference on Learning Representa- tions, 2026. 12

work page 2026
[64]

Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023

Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023. 1, 6, 8

work page arXiv 2023
[65]

arXiv preprint arXiv:2312.17448 (2023)

Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023. 3

work page arXiv 2023
[66]

Appendix Additional Backgrounds.With the advancement of deep learning techniques [3, 11, 13, 15–17, 23, 29, 30, 32– 34, 40, 41, 51, 55, 63] and the potential to eliminate the need for large-scale labeled data, self-supervised tracking has attracted increasing attention from researchers. Taking advantage of intrinsic correlations in unlabeled video data, s...

work page