pith. machine review for the scientific record. sign in

arxiv: 2605.07064 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Learning to Track Instance from Single Nature Language Description

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised VL trackingdynamic token aggregationvision-language trackingobject trackingnatural language descriptionno bounding box annotations
0
0 comments X

The pith

A Dynamic Token Aggregation Module enables self-supervised tracking of any object referred to by natural language in video without bounding box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tracker, a self-supervised vision-language tracker that follows objects described in natural language using only unlabeled videos. It does so by proposing a Dynamic Token Aggregation Module that selects important visual tokens from a template frame via an anchor token and attention scores, merges them into language tokens to cut noise and boost alignment, and then deploys the fused tokens to locate targets in search frames while propagating guidance forward in time. This setup lets the model learn instance tracking representations autonomously. Experiments on VL tracking benchmarks show it exceeds prior self-supervised methods.

Core claim

Tracker demonstrates that self-supervised VL tracking can be achieved by unequally treating visual tokens: selecting multiple important target tokens from the template frame using an anchor token, merging them according to attention scores and aggregating into language tokens to eliminate redundant visual token noise and enhance semantic alignment, and then using the fused language tokens as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames.

What carries the argument

Dynamic Token Aggregation Module, which selects key visual tokens from the template using an anchor and attention scores, merges them into language tokens, and uses the result to guide target extraction and temporal propagation in search frames.

If this is right

  • Trackers can be trained on large-scale unlabeled video data guided only by natural language descriptions.
  • Semantic alignment between vision and language improves by discarding redundant visual token noise through selective aggregation.
  • Temporal consistency across frames strengthens when fused language tokens serve as prompts for subsequent search frames.
  • Self-supervised VL tracking can outperform methods that fuse all language and visual tokens equally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective token approach may extend to other video-language tasks such as action localization or video question answering to reduce reliance on dense annotations.
  • Training at internet-video scale becomes feasible if attention-based selection proves robust across diverse domains without box labels.
  • The mechanism could yield more interpretable trackers by revealing which visual tokens language descriptions activate at each step.

Load-bearing premise

The Dynamic Token Aggregation Module can effectively select and aggregate tokens to enhance semantic alignment and temporal prompts without any supervision from bounding boxes, relying only on the attention scores and language guidance.

What would settle it

An experiment on VL tracking benchmarks in which Tracker does not surpass existing self-supervised methods, or a qualitative case where the tokens selected and aggregated by the module fail to match the object referred to in the language description.

Figures

Figures reproduced from arXiv: 2605.07064 by Bineng Zhong, Haiying Xia, Qihua Liang, Shuimu Zeng, Shuxiang Song, Yaozong Zheng.

Figure 2
Figure 2. Figure 2: The annotation requirements for different tracking tasks. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed SVLTrack framework. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our Dynamic Token Aggregation Module. It treats each visual token unequally, following three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens. iii) The fused language tokens serve as guiding signals to extract potential target tok… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the instance tokens sampled from the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of different denoising ratios. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces Tracker, a self-supervised vision-language tracker capable of tracking any object referred to by a natural language description in a video sequence without any bounding-box annotations. It proposes a Dynamic Token Aggregation Module consisting of three steps: anchor-based selection of target tokens from the template frame, merging selected tokens into language tokens via attention scores to enhance semantic alignment, and using the fused tokens as guidance to extract and propagate target tokens across search frames for temporal prompts. The approach enables self-supervised learning of language-guided tracking representations, with claims of surpassing state-of-the-art self-supervised methods on VL tracking benchmarks.

Significance. If the self-supervised mechanism proves effective, the work would offer a meaningful reduction in annotation costs for vision-language tracking by demonstrating that language descriptions alone can bootstrap instance tracking representations from unlabeled video data.

major comments (1)
  1. [Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.
minor comments (3)
  1. [Title] The title contains 'Nature Language' which should read 'Natural Language'.
  2. [Abstract] The abstract's claim that Tracker 'surpasses SOTA self-supervised methods' would be more informative if it referenced specific benchmarks, metrics, or a results table.
  3. [Abstract] LaTeX artifacts such as '{' and '}' around tracker references should be cleaned for readability in the final version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address the major comment on the self-supervised mechanism below and commit to improving the clarity of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.

    Authors: We appreciate the referee highlighting this foundational assumption. In our approach, the language description directly initializes the process: cross-attention is computed between the language tokens and all visual tokens in the template frame to identify and select the anchor token corresponding to the referred instance. This language-guided selection provides the starting bias for the attention maps without requiring bounding-box annotations, auxiliary losses, or separate validation steps. The subsequent merging and propagation steps then reinforce this alignment through temporal consistency across unlabeled frames. We agree the current description in the abstract and method section is insufficiently explicit on this initialization. We will revise the manuscript to elaborate on the language-to-visual attention for anchor selection, add a dedicated paragraph on the bootstrapping mechanism, and include attention map visualizations from early training stages to demonstrate reliable focus on the target instance. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The paper introduces Tracker via a novel Dynamic Token Aggregation Module whose three steps (anchor-based token selection, attention-score merging into language tokens, and language-guided extraction) are presented as an independent architectural choice enabling self-supervised learning from unlabeled video plus language descriptions. No equations, loss definitions, or claimed predictions in the abstract or described method reduce by construction to fitted inputs, self-citations, or renamed prior results; the bootstrap relies on the proposed attention mechanism itself rather than presupposing the target output. The central claim of surpassing SOTA self-supervised methods is positioned as empirically validated rather than definitionally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard transformer attention assumptions but introduces no new free parameters or entities explicitly in the abstract.

axioms (1)
  • domain assumption The attention mechanism in transformers can be used to select important visual tokens based on language guidance
    Implicit in the token aggregation step described in abstract.

pith-pipeline@v0.9.0 · 5547 in / 1337 out tokens · 33924 ms · 2026-05-11T01:40:41.137346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

  1. [1]

    Transformer tracking

    Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InCVPR, pages 8126–8135, 2021. 6

  2. [2]

    Mixformer: End-to-end tracking with iterative mixed atten- tion

    Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InCVPR, pages 13598–13608, 2022. 2

  3. [3]

    Unbiased missing-modality multimodal learning

    Ruiting Dai, Chenxi Li, Yandong Yan, Lisi Mo, Ke Qin, and Tao He. Unbiased missing-modality multimodal learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24507–24517, 2025. 12

  4. [4]

    ECO: efficient convolution operators for tracking

    Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ECO: efficient convolution operators for tracking. InCVPR, pages 6931–6939, 2017. 6

  5. [5]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 6

  6. [6]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 6

  7. [7]

    Lasot: A high-quality benchmark for large-scale single ob- ject tracking

    Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InCVPR, pages 5374–5383, 2019. 1, 6, 7

  8. [8]

    Lasot: A high-quality large-scale single object tracking benchmark

    Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. Lasot: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis., pages 439–461, 2021. 7

  9. [9]

    Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers

    Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff. Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers. InCVPR, pages 5851–5860, 2021. 6

  10. [10]

    Memvlt: Vision-language tracking with adaptive memory- based prompts

    Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang. Memvlt: Vision-language tracking with adaptive memory- based prompts. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 6

  11. [11]

    Zijian Gao, Kele Xu, Xingxing Zhang, Huiping Zhuang, Tianjiao Wan, Bo Ding, Xinjun Mao, and Wang Huaimin. Rethinking obscured sub-optimality in analytic learning for exemplar-free class-incremental learning.IEEE Transac- tions on Circuits and Systems for Video Technology, 36(10): 1123–1136, 2025. 12

  12. [12]

    Consistencies are all you need for semi-supervised vision-language tracking

    Jiawei Ge, Jiuxin Cao, Xuelin Zhu, Xinyu Zhang, Chang Liu, Kun Wang, and Bo Liu. Consistencies are all you need for semi-supervised vision-language tracking. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 1895–1904, 2024. 3, 6, 7

  13. [13]

    A survey of wireless sensing security from a role-based view.IEEE Communica- tions Surveys & Tutorials, 2025

    Ruixu Geng, Jianyang Wang, Yuqin Yuan, Fengquan Zhan, Tianyu Zhang, Rui Zhang, Pengcheng Huang, Dongheng Zhang, Jinbo Chen, Yang Hu, et al. A survey of wireless sensing security from a role-based view.IEEE Communica- tions Surveys & Tutorials, 2025. 12

  14. [14]

    Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022

    Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing. Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022. 6

  15. [15]

    Seg- ment concealed object with incomplete supervision.TPAMI,

    Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Longxi- ang Tang, Yulun Zhang, Linghe Kong, and Sina Farsiu. Seg- ment concealed object with incomplete supervision.TPAMI,

  16. [16]

    Reversible unfold- ing network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

    Chunming He, Fengyang Xiao, Rihan Zhang, Chengyu Fang, Deng-Ping Fan, and Sina Farsiu. Reversible unfold- ing network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

  17. [17]

    Masked video pretrain- ing advances real-world video denoising.IEEE Transactions on Multimedia, 27:622–636, 2025

    Yi Jin, Xiaoxiao Ma, Rui Zhang, Huaian Chen, Yuxuan Gu, Pengyang Ling, and Enhong Chen. Masked video pretrain- ing advances real-world video denoising.IEEE Transactions on Multimedia, 27:622–636, 2025. 12

  18. [18]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 3, 4, 8

  19. [19]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  20. [20]

    Self-supervised tracking via target- aware data synthesis.IEEE Transactions on Neural Net- works and Learning Systems, 2023

    Xin Li, Wenjie Pei, Yaowei Wang, Zhenyu He, Huchuan Lu, and Ming-Hsuan Yang. Self-supervised tracking via target- aware data synthesis.IEEE Transactions on Neural Net- works and Learning Systems, 2023. 3, 6

  21. [21]

    Dtllm-vlt: Diverse text generation for visual language tracking based on llm

    Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7283–7292, 2024. 2, 6

  22. [22]

    Dynamic updates for language adaptation in visual-language tracking

    Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 3, 6

  23. [23]

    Cross-modal augmentation for low- resource language understanding and generation

    Zichao Li and Zong Ke. Cross-modal augmentation for low- resource language understanding and generation. InPro- ceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 90–99, 2025. 12

  24. [24]

    Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Tracking by natural language specification. InCVPR, pages 7350–7358, 2017. 1, 2, 3, 6, 7

  25. [25]

    Progressive semantic-visual alignment and re- finement for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Yanjie Liang, Qiangqiang Wu, Lin Cheng, Changqun Xia, and Jia Li. Progressive semantic-visual alignment and re- finement for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 6

  26. [26]

    Girshick, Kaiming He, and Piotr Doll ´ar

    Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017. 6

  27. [27]

    Learning by analogy: Reliable supervi- sion from transformations for unsupervised optical flow es- timation

    Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. Learning by analogy: Reliable supervi- sion from transformations for unsupervised optical flow es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6489–6498,

  28. [28]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6

  29. [29]

    Tf-icon: Diffusion-based training-free cross-domain image composi- tion

    Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composi- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 2294–2305, 2023. 12

  30. [30]

    Mace: Mass concept erasure in diffu- sion models

    Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 12

  31. [31]

    Unifying visual and vision-language tracking via contrastive learning

    Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang. Unifying visual and vision-language tracking via contrastive learning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4107–4116, 2024. 1, 2, 6, 8

  32. [32]

    Towards real- istic data generation for real-world super-resolution

    Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution. InThe Thirteenth International Conference on Learning Represen- tations. 12

  33. [33]

    Lightweight adaptive feature de-drifting for compressed im- age classification.IEEE Transactions on Multimedia, 26: 6424–6436, 2024

    Long Peng, Yang Cao, Yuejin Sun, and Yang Wang. Lightweight adaptive feature de-drifting for compressed im- age classification.IEEE Transactions on Multimedia, 26: 6424–6436, 2024

  34. [34]

    Shiqing Qiu, Haoyu Wang, Yuxin Zhang, Zong Ke, and Zichao Li. Convex optimization of markov decision pro- cesses based on z transform: A theoretical framework for two-space decomposition and linear programming recon- struction.Mathematics, 13(11):1765, 2025. 12

  35. [35]

    Reid, and Silvio Savarese

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InCVPR, pages 658–666, 2019. 6

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  37. [37]

    Context-aware integration of lan- guage and visual references for natural language tracking

    Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. Context-aware integration of lan- guage and visual references for natural language tracking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19208–19217, 2024. 1

  38. [38]

    Unsupervised learning of accurate siamese tracking

    Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8101–8110, 2022. 3, 6

  39. [39]

    Aligning and prompting everything all at once for univer- sal visual perception

    Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13193–13203, 2024. 3, 4, 8

  40. [40]

    Transformer tracking with cyclic shifting window attention

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,

  41. [41]

    Compact transformer tracker with correlative masked modeling

    Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 12

  42. [42]

    Chat- tracker: Enhancing visual tracking performance via chatting with multimodal large language model.Advances in Neural Information Processing Systems, 37:39303–39324, 2024

    Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Yang Li, Chenhui Li, and Changbo Wang. Chat- tracker: Enhancing visual tracking performance via chatting with multimodal large language model.Advances in Neural Information Processing Systems, 37:39303–39324, 2024. 2

  43. [43]

    Unsupervised deep tracking

    Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1308–1317, 2019. 3, 6

  44. [44]

    Unsupervised deep representation learning for real-time tracking.International Journal of Computer Vision, 129(2):400–418, 2021

    Ning Wang, Wengang Zhou, Yibing Song, Chao Ma, Wei Liu, and Houqiang Li. Unsupervised deep representation learning for real-time tracking.International Journal of Computer Vision, 129(2):400–418, 2021. 6

  45. [45]

    Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark

    Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InCVPR, pages 13763–13773, 2021. 1, 2, 6, 7

  46. [46]

    Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B. Chan. Dropmae: Masked au- toencoders with spatial-attention dropout for tracking tasks. CVPR, abs/2304.00571, 2023. 6

  47. [47]

    Autoregressive visual tracking

    Wei Xing, Bai Yifan, Zheng Yongchao, Shi Dahu, and Gong Yihong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9697–9706, 2023. 6

  48. [48]

    Learning spatio-temporal transformer for vi- sual tracking

    Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InICCV, pages 10428–10437, 2021. 2, 6

  49. [49]

    Visa: Reasoning video object segmentation via large language models

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2025. 3

  50. [50]

    Joint feature learning and relation modeling for tracking: A one-stream framework

    Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InECCV (22), pages 341–357, 2022. 5, 6

  51. [51]

    Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition

    Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 12

  52. [52]

    Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020

    Di Yuan, Xiaojun Chang, Po-Yao Huang, Qiao Liu, and Zhenyu He. Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020. 3

  53. [53]

    X2-vlm: All-in-one pre- trained model for vision-language tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and Wangchunshu Zhou. X2-vlm: All-in-one pre- trained model for vision-language tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 3

  54. [54]

    All in one: Exploring uni- fied vision-language tracking with multi-modal alignment

    Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. All in one: Exploring uni- fied vision-language tracking with multi-modal alignment. InProceedings of the 31st ACM International Conference on Multimedia, pages 5552–5561, 2023. 6

  55. [55]

    Rf- mamba: Frequency-aware state space model for rf-based human-centric perception

    Rui Zhang, Ruixu Geng, Yadong Li, Ruiyuan Song, Han- qin Gong, Dongheng Zhang, Yang Hu, and Yan Chen. Rf- mamba: Frequency-aware state space model for rf-based human-centric perception. InThe Thirteenth International Conference on Learning Representations, 2025. 12

  56. [56]

    Diff-tracker: text-to-image diffusion models are un- supervised trackers

    Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, and Jun Liu. Diff-tracker: text-to-image diffusion models are un- supervised trackers. InEuropean Conference on Computer Vision, pages 319–337. Springer, 2025. 3, 6

  57. [57]

    Learning to track objects from unlabeled videos

    Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. Learning to track objects from unlabeled videos. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 13546–13555, 2021. 3, 6

  58. [58]

    Leveraging local and global cues for visual tracking via parallel interaction network

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2

  59. [59]

    Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 1, 6

  60. [60]

    Odtrack: Online dense temporal token learning for visual tracking

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 7588–7596, 2024. 5, 6

  61. [61]

    Decoupled spatio-temporal consistency learning for self-supervised tracking

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 3, 6

  62. [62]

    Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2

  63. [63]

    Comem: Compositional concept- graph memory for vision–language adaptation

    Heng Zhou, Jing Tang, Yanshu Li, Canran Xiao, Liwei Hou, Zong Ke, Jiawei Yao, et al. Comem: Compositional concept- graph memory for vision–language adaptation. InThe Four- teenth International Conference on Learning Representa- tions, 2026. 12

  64. [64]

    Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023

    Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023. 1, 6, 8

  65. [65]

    arXiv preprint arXiv:2312.17448 (2023)

    Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023. 3

  66. [66]

    Appendix Additional Backgrounds.With the advancement of deep learning techniques [3, 11, 13, 15–17, 23, 29, 30, 32– 34, 40, 41, 51, 55, 63] and the potential to eliminate the need for large-scale labeled data, self-supervised tracking has attracted increasing attention from researchers. Taking advantage of intrinsic correlations in unlabeled video data, s...