Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
Pith reviewed 2026-05-08 13:58 UTC · model grok-4.3
The pith
A dual-modal context association mechanism lets self-supervised trackers learn robust representations from unlabeled videos using prompts and noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a two-stage contextual association process—beginning with semantic instance patch token assignments and progressing to gradual contextual noise injection—drives the model to learn robust tracking representations from unlabeled data.
What carries the argument
The dual-modal context association mechanism, which jointly applies semantic prompts early and contextual noise later in an easy-to-hard progression during training only.
Load-bearing premise
That assigning instance patch tokens early and then gradually injecting contextual noise will reliably produce robust tracking representations without causing instability or mode collapse during self-supervised training.
What would settle it
Observing that models trained without the gradual noise injection stage perform no better than standard self-supervised trackers on tracking benchmarks from unlabeled videos.
Figures
read the original abstract
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a self-supervised visual tracking framework named Tracker that introduces a dual-modal context association mechanism jointly leveraging fine-grained semantic prompts and contextual noise. Adhering to an easy-to-hard curriculum, instance patch tokens are assigned to forward and backward tracking branches in early training to acquire basic knowledge, after which contextual noise is gradually injected into features to encourage robust representations from unlabeled videos. The mechanism is restricted to training time to preserve efficient inference, and the paper asserts that extensive experiments demonstrate superiority over existing approaches.
Significance. If the empirical claims hold, the work could meaningfully advance self-supervised tracking by addressing context modeling limitations in unlabeled settings through a prompt-plus-noise curriculum. The training-only application of the association mechanism is a clear practical strength for deployment. The approach extends curriculum learning ideas to tracking, which would be a useful addition if shown to avoid collapse while delivering measurable gains.
major comments (2)
- [Abstract] Abstract: The assertion that 'extensive experiments demonstrate the superiority of our method' is unsupported by any quantitative results, baselines, tables, figures, or error analysis in the manuscript, which is load-bearing because the central claim of learning high-quality tracking representations cannot be evaluated.
- [Method] Method description: The gradual contextual noise injection is described only qualitatively, with no details on the noise distribution, injection schedule, feature perturbation operator, or loss terms (e.g., variance regularization or stop-gradient) to prevent instability or mode collapse across forward/backward branches, which directly affects whether the easy-to-hard transition reliably produces robust representations.
minor comments (1)
- [Abstract] The framework is referred to as bolded Tracker without an expanded name or acronym definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating where revisions will be made to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'extensive experiments demonstrate the superiority of our method' is unsupported by any quantitative results, baselines, tables, figures, or error analysis in the manuscript, which is load-bearing because the central claim of learning high-quality tracking representations cannot be evaluated.
Authors: We acknowledge the referee's point that the abstract's claim would benefit from more direct support. The full manuscript includes an Experiments section with quantitative evaluations on standard benchmarks, baseline comparisons, tables, figures, and analysis. To address this directly, we will revise the abstract to incorporate a concise reference to key empirical outcomes (e.g., performance gains on tracking benchmarks) while preserving its brevity. This change will make the central claim more immediately verifiable from the abstract itself. revision: yes
-
Referee: [Method] Method description: The gradual contextual noise injection is described only qualitatively, with no details on the noise distribution, injection schedule, feature perturbation operator, or loss terms (e.g., variance regularization or stop-gradient) to prevent instability or mode collapse across forward/backward branches, which directly affects whether the easy-to-hard transition reliably produces robust representations.
Authors: We agree that the current description of the noise injection process is insufficiently precise for full reproducibility and to demonstrate stability of the curriculum. In the revised manuscript, we will augment the Method section with explicit specifications: the noise follows a Gaussian distribution whose variance increases linearly according to a defined schedule (beginning after the initial prompt-only stage); the perturbation operator adds the noise to contextual feature tokens prior to the association module; and we incorporate a variance regularization term in the overall loss along with stop-gradient operations on the backward branch to mitigate mode collapse. We will also include an algorithmic outline of the dual-stage process. revision: yes
Circularity Check
No circularity: empirical method with no derivations or self-referential fits
full rationale
The paper proposes an empirical self-supervised tracking framework based on a dual-modal context association mechanism with two training stages (early instance patch token assignment followed by gradual contextual noise injection). No mathematical derivations, equations, fitted parameters, or first-principles results are described that could reduce to the inputs by construction. The central claim rests on the training procedure enabling robust representations from unlabeled video, with superiority asserted via experiments rather than any self-citation chain, uniqueness theorem, or renaming of known results. The method is self-contained as a practical training recipe without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ar- trackv2: Prompting autoregressive tracker where to look and how to describe
Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Ar- trackv2: Prompting autoregressive tracker where to look and how to describe. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 19048–19057, 2024. 2, 6
2024
-
[2]
Learning discriminative model prediction for track- ing
Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. In ICCV, pages 6181–6190, 2019. 6
2019
-
[3]
Hiptrack: Visual tracking with historical prompts
Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19258–19267, 2024. 6
2024
-
[4]
Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking
Wenrui Cai, Qingjie Liu, and Yunhong Wang. Spmtrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16871–16881, 2025. 1
2025
-
[5]
End- to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In ECCV, pages 213–229, 2020. 4
2020
-
[6]
Transformer tracking
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 3, 6
2021
-
[7]
arXiv preprint arXiv:2304.14394 , year=
Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. CVPR, abs/2304.14394, 2023. 6, 7
-
[8]
Mixformer: End-to-end tracking with iterative mixed atten- tion
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. In CVPR, pages 13598–13608, 2022. 6, 7
2022
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 4, 5
2021
-
[10]
Lasot: A high-quality benchmark for large-scale single ob- ject tracking
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. In CVPR, pages 5374–5383, 2019. 1, 2, 5, 6
2019
-
[11]
Lasot: A high-quality large-scale single object tracking benchmark
Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. Lasot: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis., pages 439–461, 2021. 6
2021
-
[12]
Dreamtrack: Dreaming the future for multi- modal visual object tracking
Mingzhe Guo, Weiping Tan, Wenyu Ran, Liping Jing, and Zhipeng Zhang. Dreamtrack: Dreaming the future for multi- modal visual object tracking. In Proceedings of the Com- puter Vision and Pattern Recognition Conference , pages 7201–7210, 2025. 2, 6
2025
-
[13]
Got-10k: A large high-diversity benchmark for generic object tracking in the wild
Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577, 2021. 1, 2, 5
2021
-
[14]
Large- kernel spatially parallel feature fusion for monocular 3d per- ception in autonomous driving
Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. Large- kernel spatially parallel feature fusion for monocular 3d per- ception in autonomous driving. Knowledge-Based Systems, 343:115998, 2026. 11
2026
-
[15]
Exploring enhanced contextual information for video-level object tracking
Ben Kang, Xin Chen, Simiao Lai, Yang Liu, Yi Liu, and Dong Wang. Exploring enhanced contextual information for video-level object tracking. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 4194–4202, 2025. 6
2025
-
[16]
Detection of ai deepfake and fraud in on- line payments using gan-based models
Zong Ke, Shicheng Zhou, Yining Zhou, Chia Hong Chang, and Rong Zhang. Detection of ai deepfake and fraud in on- line payments using gan-based models. In 2025 8th Inter- national Conference on Advanced Algorithms and Control Engineering (ICAACE), pages 1786–1790. IEEE, 2025. 11
2025
-
[17]
The eighth visual object tracking VOT2020 challenge results
Matej Kristan, Ales Leonardis, and et.al. The eighth visual object tracking VOT2020 challenge results. In ECCV Work- shops (5), pages 547–601. Springer, 2020. 6
2020
-
[18]
Siamrpn++: Evolution of siamese visual tracking with very deep networks
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, pages 4282– 4291, 2019. 3
2019
-
[19]
Self-supervised tracking via target- aware data synthesis
Xin Li, Wenjie Pei, Yaowei Wang, Zhenyu He, Huchuan Lu, and Ming-Hsuan Yang. Self-supervised tracking via target- aware data synthesis. IEEE Transactions on Neural Net- works and Learning Systems, 2023. 1, 3, 6
2023
-
[20]
Domain meets typology: Predict- ing verb-final order from universal dependencies for finan- cial and blockchain nlp
Zichao Li and Zong Ke. Domain meets typology: Predict- ing verb-final order from universal dependencies for finan- cial and blockchain nlp. In Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Mul- tilingual NLP, pages 156–164, 2025. 11
2025
-
[21]
Retrack: Evidence-driven dual-stream directional anchor calibration network for com- posed video retrieval
Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. Retrack: Evidence-driven dual-stream directional anchor calibration network for com- posed video retrieval. In Proceedings of the AAAI Confer- ence on Artificial Intelligence, pages 23373–23381, 2026
2026
-
[22]
Conesep: Cone-based robust noise- unlearning compositional network for composed image re- trieval, 2026
Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie. Conesep: Cone-based robust noise- unlearning compositional network for composed image re- trieval, 2026. 11
2026
-
[23]
Au- toregressive sequential pretraining for visual tracking
Shiyi Liang, Yifan Bai, Yihong Gong, and Xing Wei. Au- toregressive sequential pretraining for visual tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7254–7264, 2025. 6
2025
-
[24]
Tracking meets lora: Faster training, larger model, stronger performance
Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. In European Confer- ence on Computer Vision , pages 300–318. Springer, 2024. 6
2024
-
[25]
Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014. 5
2014
-
[26]
Girshick, Kaiming He, and Piotr Doll ´ar
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017. 5
2017
-
[27]
A benchmark and simulator for UA V tracking
Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V tracking. InECCV, pages 445–461, 2016. 6
2016
-
[28]
Trackingnet: A large-scale dataset and benchmark for object tracking in the wild
Matthias M ¨uller, Adel Bibi, Silvio Giancola, Salman Al- Subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, pages 310–327, 2018. 1, 2, 5
2018
-
[29]
Multi-resolution context augmentation and dual chan- nel attention for 3d lane detection
Qirui Ning, Jinlai Zhang, Yuhang Xie, Kaifeng Liu, Kai Gao, Bin Chen, Gengbiao Chen, Qing Fan, Hui Liu, and Ronghua Du. Multi-resolution context augmentation and dual chan- nel attention for 3d lane detection. IEEE Internet of Things Journal, 2025. 11
2025
-
[30]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2
work page internal anchor Pith review arXiv 2018
-
[31]
Reid, and Silvio Savarese
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pages 658–666, 2019. 5
2019
-
[32]
Unsupervised learning of accurate siamese tracking
Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8101–8110, 2022. 6
2022
-
[33]
S2siamfc: Self-supervised fully convolutional siamese network for visual tracking
Chon Hou Sio, Yu-Jen Ma, Hong-Han Shuai, Jun-Cheng Chen, and Wen-Huang Cheng. S2siamfc: Self-supervised fully convolutional siamese network for visual tracking. In Proceedings of the 28th ACM international conference on multimedia, pages 1948–1957, 2020. 3
1948
-
[34]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998– 6008, 2017. 4
2017
-
[35]
Unsupervised deep representation learning for real-time tracking
Ning Wang, Wengang Zhou, Yibing Song, Chao Ma, Wei Liu, and Houqiang Li. Unsupervised deep representation learning for real-time tracking. International Journal of Computer Vision, 129(2):400–418, 2021. 6
2021
-
[36]
Transformer meets tracker: Exploiting temporal context for robust visual tracking
Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, pages 1571–1580, 2021. 6
2021
-
[37]
Learning correspondence from the cycle-consistency of time
Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2566–2576, 2019. 2
2019
-
[38]
Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark
Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. In CVPR, pages 13763–13773, 2021. 6
2021
-
[39]
Multi-scale convolution and dynamic task interaction detection head for efficient lightweight plum detection
Jiachun Wu, Jinlai Zhang, Jihong Zhu, Yijian Duan, Youyang Fang, Jingyu Zhu, Lairong Yin, Jiahui Jiang, Zhiyong He, Yi Huang, et al. Multi-scale convolution and dynamic task interaction detection head for efficient lightweight plum detection. Food and Bioproducts Process- ing, 149:353–367, 2025. 11
2025
- [40]
-
[41]
Spatiotemporal multi-view con- tinual dictionary learning with graph diffusion
Sheng Wu and Jinlai Zhang. Spatiotemporal multi-view con- tinual dictionary learning with graph diffusion. Knowledge- Based Systems, 316:113388, 2025. 11
2025
-
[42]
Object track- ing benchmark
Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track- ing benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37 (9):1834–1848, 2015. 6
2015
-
[43]
Correlation-aware deep tracking
Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, and Wenjun Zeng. Correlation-aware deep tracking. In CVPR, pages 8741–8750, 2022. 7
2022
-
[44]
Autore- gressive queries for adaptive tracking with spatio-temporal transformers
Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autore- gressive queries for adaptive tracking with spatio-temporal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300– 19309, 2024. 2, 6
2024
-
[45]
Autoregressive visual tracking
Wei Xing, Bai Yifan, Zheng Yongchao, Shi Dahu, and Gong Yihong. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9697–9706, 2023. 2, 6
2023
-
[46]
Less is more: Token context-aware learning for object tracking
Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, and Shuxiang Song. Less is more: Token context-aware learning for object tracking. arXiv preprint arXiv:2501.00758, 2025. 6, 7
-
[47]
Learning spatio-temporal transformer for vi- sual tracking
Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. In ICCV, pages 10428–10437, 2021. 6, 7
2021
-
[48]
Alpha-refine: Boosting tracking performance by precise bounding box estimation
Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xi- aoyun Yang. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In CVPR, pages 5289–
-
[49]
Computer Vision Foundation / IEEE, 2021. 7
2021
-
[50]
Joint feature learning and relation modeling for tracking: A one-stream framework
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV (22), pages 341–357, 2022. 1, 3, 5, 6
2022
-
[51]
Identifying money laundering risks in digital as- set transactions based on ai algorithms
Qian Yu, Zong Ke, Guofu Xiong, Yu Cheng, and Xiao- jun Guo. Identifying money laundering risks in digital as- set transactions based on ai algorithms. In 2024 4th Inter- national Conference on Electronic Information Engineering and Computer Communication (EIECC), pages 1081–1085. IEEE, 2024. 11
2024
-
[52]
Self-supervised deep correlation tracking
Di Yuan, Xiaojun Chang, Po-Yao Huang, Qiao Liu, and Zhenyu He. Self-supervised deep correlation tracking. IEEE Transactions on Image Processing, 30:976–985, 2020. 2, 6
2020
-
[53]
Self- supervised object tracking with cycle-consistent siamese net- works
Weihao Yuan, Michael Yu Wang, and Qifeng Chen. Self- supervised object tracking with cycle-consistent siamese net- works. In 2020 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS) , pages 10351–10358. IEEE, 2020. 2
2020
-
[54]
Adaptive dual cross-attention network for multispectral object detection in autonomous driving
Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. Adaptive dual cross-attention network for multispectral object detection in autonomous driving. Expert Systems with Applications , page 132012,
-
[55]
Multivariate feature learning and as- sociative spatial information enhancement for snow object detection in autonomous driving
Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Lin- long Lei, and Kefu Yi. Multivariate feature learning and as- sociative spatial information enhancement for snow object detection in autonomous driving. Engineering Applications of Artificial Intelligence, 175:114672, 2026. 11
2026
-
[56]
Ocean: Object-aware anchor-free tracking
Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. Ocean: Object-aware anchor-free tracking. In ECCV, pages 771–787, 2020. 7
2020
-
[57]
Diff-tracker: text-to-image diffusion models are un- supervised trackers
Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, and Jun Liu. Diff-tracker: text-to-image diffusion models are un- supervised trackers. In European Conference on Computer Vision, pages 319–337. Springer, 2025. 6
2025
-
[58]
Learning to track objects from unlabeled videos
Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. Learning to track objects from unlabeled videos. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 13546–13555, 2021. 6
2021
-
[59]
Leveraging local and global cues for visual tracking via parallel interaction network
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4):1671–1683, 2022. 2
2022
-
[60]
Toward unified token learning for vision-language tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 2
2023
-
[61]
Odtrack: Online dense temporal token learning for visual tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 7588–7596, 2024. 1, 2, 6, 7
2024
-
[62]
Decoupled spatio-temporal consistency learning for self-supervised tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learning for self-supervised tracking. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 10635– 10643, 2025. 2, 3, 5, 6, 7
2025
-
[63]
Towards universal modal tracking with online dense temporal token learning
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 2
2025
-
[64]
Appendix Additional Backgrounds. With the advancement of deep learning techniques [14, 16, 20–22, 29, 39, 41, 50, 53, 54] and the potential to eliminate the need for large-scale la- beled data, self-supervised tracking has attracted increasing attention from researchers. Taking advantage of intrinsic correlations in unlabeled video data, such as temporal ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.