arxiv: 2512.22799 · v2 · submitted 2025-12-28 · 💻 cs.CV

Recognition: no theorem link

VPTracker: Global Vision-Language Tracking via Visual Prompt

Jingchao Wang , Kaiwen Zhou , Zhijian Wu , Kunhua Ji , Dingjiang Huang , Yefeng Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language trackingmultimodal large language modelsglobal trackingvisual promptingobject localizationsemantic reasoning

0 comments

The pith

Multimodal language models perform global object tracking when given location-aware visual prompts to focus their search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes the first global tracking framework for vision-language tasks that relies on Multimodal Large Language Models to search the full image for the target. Local search approaches often lose the target due to changes in view, occlusions or fast motion, but global search brings its own problem of confusion with other similar looking or described objects. The key addition is a prompting method that feeds the model the previous location as a spatial prior so it first checks the local region and only falls back to full-image reasoning when needed. A sympathetic reader would care because this could make trackers far more reliable in real-world videos where targets move unpredictably and scenes contain many potential distractors.

Core claim

VPTracker is the first global tracking framework based on Multimodal Large Language Models that exploits their powerful semantic reasoning to locate targets across the entire image space. A location-aware visual prompting mechanism constructs region-level prompts based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content.

What carries the argument

Location-aware visual prompting mechanism, which adds spatial priors from the previous target location to the MLLM input to prioritize region-level recognition and suppress similar-object interference.

If this is right

Significantly enhances tracking stability under viewpoint changes, occlusions, and rapid target movements.
Effectively suppresses interference from distracting visual content.
Retains advantages of global tracking while improving target disambiguation.
Opens a new avenue for integrating MLLMs into visual tracking applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting idea could be adapted to improve global search in other vision tasks such as detection or segmentation.
If the prompting adds little overhead, it may allow MLLM-based trackers to run on longer sequences without accumulating drift.
Future work might explore how to automatically adjust the prompt strength based on scene complexity or target motion speed.

Load-bearing premise

The location-aware visual prompting mechanism can reliably direct the model to the correct region and suppress similar distractors without creating new failure modes or excessive computational overhead.

What would settle it

Run the tracker on video sequences containing sudden large viewpoint shifts and multiple objects that match the language description; if success rates stay high without extra failures from prompt errors, the claim holds.

Figures

Figures reproduced from arXiv: 2512.22799 by Dingjiang Huang, Jingchao Wang, Kaiwen Zhou, Kunhua Ji, Yefeng Zheng, Zhijian Wu.

**Figure 1.** Figure 1: An overview of our proposed VPTracker. Input: You are a visual object tracker. Given the visual information of the target object <template><image></template>, and the language description of the target object <ref> the first girl from left to right on the stage </ref>. Then, detect the target object is in search image <search><image></search>. 1. determine whether the target is visible in the search image.… view at source ↗

**Figure 2.** Figure 2: Prompt used for Vision-Language Tracking. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visual comparison of tracking results with and without Visual Prompt [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison of tracking results with and without Visual Prompt [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPTracker tries global MLLM search for vision-language tracking with a prior-based prompt to limit distractions, but the rule for switching to full global mode stays underspecified.

read the letter

The core idea is to move vision-language tracking from local search to global inference over the whole image using an MLLM, then use a location-aware visual prompt built from the last known box to keep the model from getting lost in similar-looking distractors. They position this as the first such global framework and say it improves stability when targets move fast or get occluded. That matches the abstract's claim and the code release helps anyone who wants to test it directly. The prompting step is a straightforward way to inject spatial context without losing the semantic reasoning that MLLMs bring, and it addresses a real weakness in prior local-only trackers. Releasing the implementation is useful for reproducibility. The experiments are described as extensive and show gains in challenging cases, which is the main evidence offered. The soft spot is the fallback logic. The paper says the model resorts to global inference only when necessary, but it does not give an explicit criterion, threshold, or failure detector for that switch. If the prior location drifts, the region-level prompt could steer the model the wrong way, and without a clear rule it is hard to know whether the method actually avoids reverting to local-search failure modes. Computational cost of repeated global calls is mentioned only in passing, so the practical overhead is not fully clear either. The citation pattern follows standard VLT and MLLM lines without obvious gaps. This paper is for people working on trackers that combine language and vision and who want to explore global search with large models. A reader focused on robustness in robotics or surveillance would find the direction worth checking. It deserves peer review because the idea is concrete, the code is public, and the central limitation of local search is real, even if the decision rule for global fallback needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to present VPTracker as the first global vision-language tracking framework based on Multimodal Large Language Models (MLLMs). It proposes a location-aware visual prompting mechanism that constructs a region-level prompt from the target's previous location to prioritize local recognition within the MLLM while resorting to global inference only when necessary, thereby improving robustness to viewpoint changes, occlusions, rapid motion, and distractors from similar objects.

Significance. If the central claims are substantiated, the work would be significant for integrating MLLMs into visual tracking by enabling reliable global search without excessive distraction, potentially advancing robustness beyond local-search baselines. The open-sourced code at the provided GitHub link supports reproducibility and allows direct verification of the prompting implementation.

major comments (2)

[Abstract and Section 3] Abstract and Section 3 (location-aware visual prompting): the mechanism is described as resorting to global inference 'only when necessary,' but no explicit criterion, threshold, failure detector, or decision rule is provided for triggering the fallback. This is load-bearing for the robustness claim, as accumulated drift can invalidate the spatial prior and risk untested error propagation while still asserting global advantages.
[Section 4] Section 4 (Experiments): the abstract asserts that 'extensive experiments show significantly enhanced tracking stability and target disambiguation,' yet the manuscript provides no detailed ablation isolating the prompting components, no failure-case analysis of the fallback, and no quantitative comparison of drift rates with/without the location-aware prior, leaving the support for the central claim difficult to verify.

minor comments (1)

[Section 3] Notation for the visual prompt construction could be formalized with an equation or pseudocode to clarify how the spatial prior is encoded into the MLLM input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review and valuable suggestions. We will revise the manuscript to address the concerns raised regarding the description of the fallback mechanism and the experimental validations. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and Section 3 (location-aware visual prompting): the mechanism is described as resorting to global inference 'only when necessary,' but no explicit criterion, threshold, failure detector, or decision rule is provided for triggering the fallback. This is load-bearing for the robustness claim, as accumulated drift can invalidate the spatial prior and risk untested error propagation while still asserting global advantages.

Authors: We agree with the referee that an explicit description of the triggering criterion for global inference is necessary to fully substantiate the robustness claims. The current manuscript describes the location-aware prompting at a high level but does not detail the exact decision rule. In the revised version, we will expand Section 3 to include the precise criterion used to determine when to resort to global search, a failure detector mechanism, and an analysis of potential error propagation. We will also add pseudocode for the tracking procedure. revision: yes
Referee: [Section 4] Section 4 (Experiments): the abstract asserts that 'extensive experiments show significantly enhanced tracking stability and target disambiguation,' yet the manuscript provides no detailed ablation isolating the prompting components, no failure-case analysis of the fallback, and no quantitative comparison of drift rates with/without the location-aware prior, leaving the support for the central claim difficult to verify.

Authors: We acknowledge that the experimental section lacks the specific ablations and analyses mentioned. While we present overall performance improvements and some component studies, we did not include isolated ablations for the location-aware prompting, failure-case breakdowns for the fallback, or direct drift rate comparisons. In the major revision, we will add these elements to Section 4: detailed ablations on prompting components, qualitative and quantitative failure-case analysis, and metrics showing drift reduction with the location-aware prior. This will provide clearer evidence for the claims of enhanced stability and disambiguation. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework with experimental validation

full rationale

The paper introduces VPTracker as a novel global tracking framework using MLLMs and a location-aware visual prompting mechanism. Claims rest on the design of incorporating spatial priors from prior location to prioritize region-level recognition, with fallback to global inference described as a new contribution. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing for uniqueness or ansatzes in the provided text. Experimental results are cited as independent support rather than reducing to input definitions by construction. The derivation chain is self-contained as a proposed architecture without self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that existing MLLMs possess sufficient semantic reasoning for global localization and that the added visual prompt can be effectively integrated without retraining. No free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Multimodal large language models can perform semantic reasoning to locate described targets across an entire image.
Invoked when stating that MLLMs enable global search.

invented entities (1)

location-aware visual prompt no independent evidence
purpose: Incorporate spatial priors from previous target location to prioritize region-level recognition in the MLLM.
New mechanism proposed to balance global search benefits with distraction suppression.

pith-pipeline@v0.9.0 · 5506 in / 1276 out tokens · 21570 ms · 2026-05-16T18:59:17.441826+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

One-stream vision-language memory network for object tracking,

Huanlong Zhang, Jingchao Wang, Jianwei Zhang, Tianzhu Zhang, and Bineng Zhong, “One-stream vision-language memory network for object tracking,”IEEE Transactions on Multimedia, vol. 26, pp. 1720–1730, 2024

work page 2024
[2]

Joint feature learning and relation modeling for tracking: A one-stream framework,

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inECCV, 2022

work page 2022
[3]

Overcoming feature contamination by unidirectional information modeling for vision-language tracking,

Jingchao Wang, Zhijian Wu, Wenlong Zhang, Wenhui Liu, Jianwei Zhang, and Dingjiang Huang, “Overcoming feature contamination by unidirectional information modeling for vision-language tracking,” in 2025 IEEE International Conference on Multimedia and Expo (ICME), 2025, pp. 1–6

work page 2025
[4]

Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu, “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13763–13773

work page 2021
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Reasoningtrack: Chain-of-thought reasoning for long-term vision-language tracking,

Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, and Zhipeng Zhang, “Reasoningtrack: Chain-of-thought reasoning for long-term vision-language tracking,”arXiv preprint arXiv:2508.05221, 2025

work page arXiv 2025
[7]

High- speed tracking with kernelized correlation filters,

Jo ˜ao F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista, “High- speed tracking with kernelized correlation filters,”IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 3, pp. 583–596, 2014

work page 2014
[8]

Eco: Efficient convolution operators for tracking,

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg, “Eco: Efficient convolution operators for tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6638–6646

work page 2017
[9]

Fully-convolutional siamese networks for object tracking,

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr, “Fully-convolutional siamese networks for object tracking,” inEuropean conference on computer vision. Springer, 2016, pp. 850–865

work page 2016
[10]

High performance visual tracking with siamese region proposal network,

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8971–8980

work page 2018
[11]

Siammask: A framework for fast online object tracking and segmentation,

Weiming Hu, Qiang Wang, Li Zhang, Luca Bertinetto, and Philip HS Torr, “Siammask: A framework for fast online object tracking and segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3072–3089, 2023

work page 2023
[12]

Transformer tracking,

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu, “Transformer tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135

work page 2021
[13]

Learning spatial-frequency transformer for visual object tracking,

Chuanming Tang, Xiao Wang, Yuanchao Bai, Zhe Wu, Jianlin Zhang, and Yongmei Huang, “Learning spatial-frequency transformer for visual object tracking,”IEEE Transactions on Circuits and Systems for Video Technology, 2023

work page 2023
[14]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Unlocking the potential of mllms in referring expres- sion segmentation via a light-weight mask decoder,

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, and Hong Wang, “Unlocking the potential of mllms in referring expres- sion segmentation via a light-weight mask decoder,”arXiv preprint arXiv:2508.04107, 2025

work page arXiv 2025
[18]

Mixformer: End-to-end tracking with iterative mixed attention,

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 13608–13618

work page 2022
[19]

Aiatrack: Attention in attention for transformer visual tracking,

Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146– 164

work page 2022
[20]

Joint visual ground- ing and tracking with natural language specification,

Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He, “Joint visual ground- ing and tracking with natural language specification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23151–23160

work page 2023
[21]

All in one: Exploring unified vision-language tracking with multi-modal alignment,

Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang, “All in one: Exploring unified vision-language tracking with multi-modal alignment,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5552–5561

work page 2023
[22]

Toward unified token learning for vision-language tracking,

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li, “Toward unified token learning for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2125–2135, 2023

work page 2023
[23]

Citetracker: Correlating image and text for visual tracking,

Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, and Ming-Hsuan Yang, “Citetracker: Correlating image and text for visual tracking,” inICCV, 2023

work page 2023
[24]

Robust object mod- eling for visual tracking,

Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu, “Robust object mod- eling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

work page 2023
[25]

Generalized relation modeling for transformer tracking,

Shenyuan Gao, Chunluan Zhou, and Jun Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18686–18695

work page 2023
[26]

Odtrack: Online dense temporal token learning for visual tracking,

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024

work page 2024
[27]

Explicit visual prompts for visual object tracking,

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, and Xianxian Li, “Explicit visual prompts for visual object tracking,” inAAAI, 2024

work page 2024
[28]

Unifying visual and vision-language tracking via contrastive learning,

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang, “Unifying visual and vision-language tracking via contrastive learning,” 2024

work page 2024
[29]

Autoregressive queries for adaptive tracking with spatio-temporal transformers,

Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji, “Autoregressive queries for adaptive tracking with spatio-temporal transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19300–19309

work page 2024
[30]

Sutrack: Towards simple and unified single object tracking,

Xin Chen, Ben Kang, Wanting Geng, Jiawen Zhu, Yi Liu, Dong Wang, and Huchuan Lu, “Sutrack: Towards simple and unified single object tracking,” 2025

work page 2025
[31]

Less is more: Token context-aware learning for object tracking,

Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, and Shuxiang Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 8824–8832

work page 2025
[32]

Enhancing vision-language tracking by effectively converting textual cues into visual cues,

Xiaokun Feng, Dailing Zhang, Shiyu Hu, Xuchen Li, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang, “Enhancing vision-language tracking by effectively converting textual cues into visual cues,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[33]

Transformer meets tracker: Exploiting temporal context for robust visual tracking,

Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580

work page 2021
[34]

Learn to match: Automatic matching network design for visual track- ing,

Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu, “Learn to match: Automatic matching network design for visual track- ing,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13339–13348

work page 2021
[35]

Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers,

Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff, “Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5851–5860

work page 2021
[36]

Capsule-based object tracking with natural language specification,

Ding Ma and Xiangqian Wu, “Capsule-based object tracking with natural language specification,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1948–1956

work page 2021
[37]

Divert more attention to vision-language tracking,

Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing, “Divert more attention to vision-language tracking,”Advances in Neural Information Processing Systems, vol. 35, pp. 4446–4460, 2022

work page 2022
[38]

Cross-modal target retrieval for tracking by natural language,

Yihao Li, Jun Yu, Zhongpeng Cai, and Yuwen Pan, “Cross-modal target retrieval for tracking by natural language,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4931–4940

work page 2022
[39]

Onetracker: Unifying visual object tracking with foundation models and efficient tuning,

Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al., “Onetracker: Unifying visual object tracking with foundation models and efficient tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19079–19091

work page 2024
[40]

Context-aware integration of language and visual references for natural language tracking,

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen, “Context-aware integration of language and visual references for natural language tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19208–19217

work page 2024
[41]

Mamba adapter: Efficient multi-modal fusion for vision-language tracking,

Liangtao Shi, Bineng Zhong, Qihua Liang, Xiantao Hu, Zhiyi Mo, and Shuxiang Song, “Mamba adapter: Efficient multi-modal fusion for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[42]

Learning language prompt for vision-language tracking,

Chengao Zong, Jie Zhao, Xin Chen, Huchuan Lu, and Dong Wang, “Learning language prompt for vision-language tracking,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 35, no. 9, pp. 9287–9299, 2025

work page 2025
[43]

Robust tracking via mamba-based context-aware token learning,

Jinxia Xie, Bineng Zhong, Qihua Liang, Ning Li, Zhiyi Mo, and Shuxiang Song, “Robust tracking via mamba-based context-aware token learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 8727–8735

work page 2025