Recognition: no theorem link
Learning to Track Instance from Single Nature Language Description
Pith reviewed 2026-05-11 01:40 UTC · model grok-4.3
The pith
A Dynamic Token Aggregation Module enables self-supervised tracking of any object referred to by natural language in video without bounding box labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tracker demonstrates that self-supervised VL tracking can be achieved by unequally treating visual tokens: selecting multiple important target tokens from the template frame using an anchor token, merging them according to attention scores and aggregating into language tokens to eliminate redundant visual token noise and enhance semantic alignment, and then using the fused language tokens as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames.
What carries the argument
Dynamic Token Aggregation Module, which selects key visual tokens from the template using an anchor and attention scores, merges them into language tokens, and uses the result to guide target extraction and temporal propagation in search frames.
If this is right
- Trackers can be trained on large-scale unlabeled video data guided only by natural language descriptions.
- Semantic alignment between vision and language improves by discarding redundant visual token noise through selective aggregation.
- Temporal consistency across frames strengthens when fused language tokens serve as prompts for subsequent search frames.
- Self-supervised VL tracking can outperform methods that fuse all language and visual tokens equally.
Where Pith is reading between the lines
- The selective token approach may extend to other video-language tasks such as action localization or video question answering to reduce reliance on dense annotations.
- Training at internet-video scale becomes feasible if attention-based selection proves robust across diverse domains without box labels.
- The mechanism could yield more interpretable trackers by revealing which visual tokens language descriptions activate at each step.
Load-bearing premise
The Dynamic Token Aggregation Module can effectively select and aggregate tokens to enhance semantic alignment and temporal prompts without any supervision from bounding boxes, relying only on the attention scores and language guidance.
What would settle it
An experiment on VL tracking benchmarks in which Tracker does not surpass existing self-supervised methods, or a qualitative case where the tokens selected and aggregated by the module fail to match the object referred to in the language description.
Figures
read the original abstract
How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tracker, a self-supervised vision-language tracker capable of tracking any object referred to by a natural language description in a video sequence without any bounding-box annotations. It proposes a Dynamic Token Aggregation Module consisting of three steps: anchor-based selection of target tokens from the template frame, merging selected tokens into language tokens via attention scores to enhance semantic alignment, and using the fused tokens as guidance to extract and propagate target tokens across search frames for temporal prompts. The approach enables self-supervised learning of language-guided tracking representations, with claims of surpassing state-of-the-art self-supervised methods on VL tracking benchmarks.
Significance. If the self-supervised mechanism proves effective, the work would offer a meaningful reduction in annotation costs for vision-language tracking by demonstrating that language descriptions alone can bootstrap instance tracking representations from unlabeled video data.
major comments (1)
- [Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.
minor comments (3)
- [Title] The title contains 'Nature Language' which should read 'Natural Language'.
- [Abstract] The abstract's claim that Tracker 'surpasses SOTA self-supervised methods' would be more informative if it referenced specific benchmarks, metrics, or a results table.
- [Abstract] LaTeX artifacts such as '{' and '}' around tracker references should be cleaned for readability in the final version.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address the major comment on the self-supervised mechanism below and commit to improving the clarity of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Dynamic Token Aggregation Module] Abstract and method description of the Dynamic Token Aggregation Module: the three-step process (anchor-based selection, attention-score-based merging into language tokens, then language-guided extraction from search frames) is presented as producing useful tracking representations from unlabeled video plus language only, but no initialization, auxiliary loss, or validation mechanism is described to ensure that early attention maps reliably highlight the referred instance rather than distractors or background. This assumption is load-bearing for the self-supervised loop to function as claimed.
Authors: We appreciate the referee highlighting this foundational assumption. In our approach, the language description directly initializes the process: cross-attention is computed between the language tokens and all visual tokens in the template frame to identify and select the anchor token corresponding to the referred instance. This language-guided selection provides the starting bias for the attention maps without requiring bounding-box annotations, auxiliary losses, or separate validation steps. The subsequent merging and propagation steps then reinforce this alignment through temporal consistency across unlabeled frames. We agree the current description in the abstract and method section is insufficiently explicit on this initialization. We will revise the manuscript to elaborate on the language-to-visual attention for anchor selection, add a dedicated paragraph on the bootstrapping mechanism, and include attention map visualizations from early training stages to demonstrate reliable focus on the target instance. revision: partial
Circularity Check
No significant circularity detected; derivation is self-contained
full rationale
The paper introduces Tracker via a novel Dynamic Token Aggregation Module whose three steps (anchor-based token selection, attention-score merging into language tokens, and language-guided extraction) are presented as an independent architectural choice enabling self-supervised learning from unlabeled video plus language descriptions. No equations, loss definitions, or claimed predictions in the abstract or described method reduce by construction to fitted inputs, self-citations, or renamed prior results; the bootstrap relies on the proposed attention mechanism itself rather than presupposing the target output. The central claim of surpassing SOTA self-supervised methods is positioned as empirically validated rather than definitionally forced.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The attention mechanism in transformers can be used to select important visual tokens based on language guidance
Reference graph
Works this paper leans on
-
[1]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InCVPR, pages 8126–8135, 2021. 6
work page 2021
-
[2]
Mixformer: End-to-end tracking with iterative mixed atten- tion
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InCVPR, pages 13598–13608, 2022. 2
work page 2022
-
[3]
Unbiased missing-modality multimodal learning
Ruiting Dai, Chenxi Li, Yandong Yan, Lisi Mo, Ke Qin, and Tao He. Unbiased missing-modality multimodal learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24507–24517, 2025. 12
work page 2025
-
[4]
ECO: efficient convolution operators for tracking
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ECO: efficient convolution operators for tracking. InCVPR, pages 6931–6939, 2017. 6
work page 2017
-
[5]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 6
work page 2019
-
[6]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 6
work page 2021
-
[7]
Lasot: A high-quality benchmark for large-scale single ob- ject tracking
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InCVPR, pages 5374–5383, 2019. 1, 6, 7
work page 2019
-
[8]
Lasot: A high-quality large-scale single object tracking benchmark
Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. Lasot: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis., pages 439–461, 2021. 7
work page 2021
-
[9]
Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers
Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff. Siamese natural language tracker: Tracking by natural lan- guage descriptions with siamese trackers. InCVPR, pages 5851–5860, 2021. 6
work page 2021
-
[10]
Memvlt: Vision-language tracking with adaptive memory- based prompts
Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang. Memvlt: Vision-language tracking with adaptive memory- based prompts. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 6
work page 2024
-
[11]
Zijian Gao, Kele Xu, Xingxing Zhang, Huiping Zhuang, Tianjiao Wan, Bo Ding, Xinjun Mao, and Wang Huaimin. Rethinking obscured sub-optimality in analytic learning for exemplar-free class-incremental learning.IEEE Transac- tions on Circuits and Systems for Video Technology, 36(10): 1123–1136, 2025. 12
work page 2025
-
[12]
Consistencies are all you need for semi-supervised vision-language tracking
Jiawei Ge, Jiuxin Cao, Xuelin Zhu, Xinyu Zhang, Chang Liu, Kun Wang, and Bo Liu. Consistencies are all you need for semi-supervised vision-language tracking. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 1895–1904, 2024. 3, 6, 7
work page 1904
-
[13]
Ruixu Geng, Jianyang Wang, Yuqin Yuan, Fengquan Zhan, Tianyu Zhang, Rui Zhang, Pengcheng Huang, Dongheng Zhang, Jinbo Chen, Yang Hu, et al. A survey of wireless sensing security from a role-based view.IEEE Communica- tions Surveys & Tutorials, 2025. 12
work page 2025
-
[14]
Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022
Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing. Divert more attention to vision-language tracking.NIPS, abs/2207.01076, 2022. 6
-
[15]
Seg- ment concealed object with incomplete supervision.TPAMI,
Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Longxi- ang Tang, Yulun Zhang, Linghe Kong, and Sina Farsiu. Seg- ment concealed object with incomplete supervision.TPAMI,
-
[16]
Chunming He, Fengyang Xiao, Rihan Zhang, Chengyu Fang, Deng-Ping Fan, and Sina Farsiu. Reversible unfold- ing network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025
-
[17]
Yi Jin, Xiaoxiao Ma, Rui Zhang, Huaian Chen, Yuxuan Gu, Pengyang Ling, and Enhong Chen. Masked video pretrain- ing advances real-world video denoising.IEEE Transactions on Multimedia, 27:622–636, 2025. 12
work page 2025
-
[18]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 3, 4, 8
work page 2024
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3
work page 2023
-
[20]
Xin Li, Wenjie Pei, Yaowei Wang, Zhenyu He, Huchuan Lu, and Ming-Hsuan Yang. Self-supervised tracking via target- aware data synthesis.IEEE Transactions on Neural Net- works and Learning Systems, 2023. 3, 6
work page 2023
-
[21]
Dtllm-vlt: Diverse text generation for visual language tracking based on llm
Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7283–7292, 2024. 2, 6
work page 2024
-
[22]
Dynamic updates for language adaptation in visual-language tracking
Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 3, 6
work page 2025
-
[23]
Cross-modal augmentation for low- resource language understanding and generation
Zichao Li and Zong Ke. Cross-modal augmentation for low- resource language understanding and generation. InPro- ceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 90–99, 2025. 12
work page 2025
-
[24]
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Tracking by natural language specification. InCVPR, pages 7350–7358, 2017. 1, 2, 3, 6, 7
work page 2017
-
[25]
Yanjie Liang, Qiangqiang Wu, Lin Cheng, Changqun Xia, and Jia Li. Progressive semantic-visual alignment and re- finement for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 6
work page 2024
-
[26]
Girshick, Kaiming He, and Piotr Doll ´ar
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017. 6
work page 2017
-
[27]
Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. Learning by analogy: Reliable supervi- sion from transformations for unsupervised optical flow es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6489–6498,
-
[28]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6
work page 2019
-
[29]
Tf-icon: Diffusion-based training-free cross-domain image composi- tion
Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composi- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 2294–2305, 2023. 12
work page 2023
-
[30]
Mace: Mass concept erasure in diffu- sion models
Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 12
work page 2024
-
[31]
Unifying visual and vision-language tracking via contrastive learning
Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang. Unifying visual and vision-language tracking via contrastive learning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4107–4116, 2024. 1, 2, 6, 8
work page 2024
-
[32]
Towards real- istic data generation for real-world super-resolution
Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution. InThe Thirteenth International Conference on Learning Represen- tations. 12
-
[33]
Long Peng, Yang Cao, Yuejin Sun, and Yang Wang. Lightweight adaptive feature de-drifting for compressed im- age classification.IEEE Transactions on Multimedia, 26: 6424–6436, 2024
work page 2024
-
[34]
Shiqing Qiu, Haoyu Wang, Yuxin Zhang, Zong Ke, and Zichao Li. Convex optimization of markov decision pro- cesses based on z transform: A theoretical framework for two-space decomposition and linear programming recon- struction.Mathematics, 13(11):1765, 2025. 12
work page 2025
-
[35]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InCVPR, pages 658–666, 2019. 6
work page 2019
-
[36]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[37]
Context-aware integration of lan- guage and visual references for natural language tracking
Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. Context-aware integration of lan- guage and visual references for natural language tracking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19208–19217, 2024. 1
work page 2024
-
[38]
Unsupervised learning of accurate siamese tracking
Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8101–8110, 2022. 3, 6
work page 2022
-
[39]
Aligning and prompting everything all at once for univer- sal visual perception
Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13193–13203, 2024. 3, 4, 8
work page 2024
-
[40]
Transformer tracking with cyclic shifting window attention
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800,
-
[41]
Compact transformer tracker with correlative masked modeling
Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 12
work page 2023
-
[42]
Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Yang Li, Chenhui Li, and Changbo Wang. Chat- tracker: Enhancing visual tracking performance via chatting with multimodal large language model.Advances in Neural Information Processing Systems, 37:39303–39324, 2024. 2
work page 2024
-
[43]
Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1308–1317, 2019. 3, 6
work page 2019
-
[44]
Ning Wang, Wengang Zhou, Yibing Song, Chao Ma, Wei Liu, and Houqiang Li. Unsupervised deep representation learning for real-time tracking.International Journal of Computer Vision, 129(2):400–418, 2021. 6
work page 2021
-
[45]
Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark
Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InCVPR, pages 13763–13773, 2021. 1, 2, 6, 7
work page 2021
- [46]
-
[47]
Autoregressive visual tracking
Wei Xing, Bai Yifan, Zheng Yongchao, Shi Dahu, and Gong Yihong. Autoregressive visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9697–9706, 2023. 6
work page 2023
-
[48]
Learning spatio-temporal transformer for vi- sual tracking
Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InICCV, pages 10428–10437, 2021. 2, 6
work page 2021
-
[49]
Visa: Reasoning video object segmentation via large language models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2025. 3
work page 2025
-
[50]
Joint feature learning and relation modeling for tracking: A one-stream framework
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InECCV (22), pages 341–357, 2022. 5, 6
work page 2022
-
[51]
Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervised cross-domain visual emotion recognition. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 3888–3898, 2025. 12
work page 2025
-
[52]
Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020
Di Yuan, Xiaojun Chang, Po-Yao Huang, Qiao Liu, and Zhenyu He. Self-supervised deep correlation tracking.IEEE Transactions on Image Processing, 30:976–985, 2020. 3
work page 2020
-
[53]
Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and Wangchunshu Zhou. X2-vlm: All-in-one pre- trained model for vision-language tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 3
work page 2023
-
[54]
All in one: Exploring uni- fied vision-language tracking with multi-modal alignment
Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. All in one: Exploring uni- fied vision-language tracking with multi-modal alignment. InProceedings of the 31st ACM International Conference on Multimedia, pages 5552–5561, 2023. 6
work page 2023
-
[55]
Rf- mamba: Frequency-aware state space model for rf-based human-centric perception
Rui Zhang, Ruixu Geng, Yadong Li, Ruiyuan Song, Han- qin Gong, Dongheng Zhang, Yang Hu, and Yan Chen. Rf- mamba: Frequency-aware state space model for rf-based human-centric perception. InThe Thirteenth International Conference on Learning Representations, 2025. 12
work page 2025
-
[56]
Diff-tracker: text-to-image diffusion models are un- supervised trackers
Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, and Jun Liu. Diff-tracker: text-to-image diffusion models are un- supervised trackers. InEuropean Conference on Computer Vision, pages 319–337. Springer, 2025. 3, 6
work page 2025
-
[57]
Learning to track objects from unlabeled videos
Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. Learning to track objects from unlabeled videos. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 13546–13555, 2021. 3, 6
work page 2021
-
[58]
Leveraging local and global cues for visual tracking via parallel interaction network
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2
work page 2022
-
[59]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 1, 6
work page 2023
-
[60]
Odtrack: Online dense temporal token learning for visual tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 7588–7596, 2024. 5, 6
work page 2024
-
[61]
Decoupled spatio-temporal consistency learning for self-supervised tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635– 10643, 2025. 3, 6
work page 2025
-
[62]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2
work page 2025
-
[63]
Comem: Compositional concept- graph memory for vision–language adaptation
Heng Zhou, Jing Tang, Yanshu Li, Canran Xiao, Liwei Hou, Zong Ke, Jiawei Yao, et al. Comem: Compositional concept- graph memory for vision–language adaptation. InThe Four- teenth International Conference on Learning Representa- tions, 2026. 12
work page 2026
-
[64]
Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023
Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language specifi- cation.CVPR, abs/2303.12027, 2023. 1, 6, 8
-
[65]
arXiv preprint arXiv:2312.17448 (2023)
Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023. 3
-
[66]
Appendix Additional Backgrounds.With the advancement of deep learning techniques [3, 11, 13, 15–17, 23, 29, 30, 32– 34, 40, 41, 51, 55, 63] and the potential to eliminate the need for large-scale labeled data, self-supervised tracking has attracted increasing attention from researchers. Taking advantage of intrinsic correlations in unlabeled video data, s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.