Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

Dan Xu; Jin Tang; Lan Chen; Liye Jin; Xiao Wang; Yaowei Wang; Yonghong Tian; Yuehang Li

arxiv: 2606.29357 · v1 · pith:RWYDAMARnew · submitted 2026-06-28 · 💻 cs.CV · cs.AI· cs.LG

Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

Xiao Wang , Liye Jin , Dan Xu , Yuehang Li , Lan Chen , Yaowei Wang , Yonghong Tian , Jin Tang This is my paper

Pith reviewed 2026-06-30 07:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision-language trackingnatural language specificationdynamic parsingadaptive textual updateQwen-VLtracking benchmarks

0 comments

The pith

A language dependency parsing mechanism with Qwen-VL enables dynamic, component-aware updates to natural language specifications for vision-language tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing vision-language trackers suffer from semantic-visual mismatches when target appearance or position changes because they generate or update textual descriptions in error-prone ways. It introduces a parsing step that breaks natural language specs into target objects, semantic concepts, and background context, then uses Qwen-VL to update only the relevant components adaptively. This is shown to reduce erroneous updates and hallucinations while delivering higher accuracy than baselines. A sympathetic reader would care because reliable natural-language guidance could make trackers more usable in real scenes where objects deform or lighting shifts.

Core claim

A novel language dependency parsing mechanism distills core tracking principal components encompassing target objects, semantic concepts, and background contextual information; component-aware adaptive textual description updates are then performed by exploiting the cross-modal understanding capability of the pre-trained vision-language model Qwen-VL. Integrating these modules into the baseline yields consistent and superior tracking performance on TNL2K, LaSOT, TNLLT, and OTB-LANG.

What carries the argument

Language dependency parsing mechanism that separates target objects, semantic concepts, and background context, followed by component-aware adaptive updates driven by Qwen-VL.

If this is right

Semantic-visual mismatch from dynamic target variations is reduced.
Erroneous target updating, background distraction, and hallucination artifacts are avoided.
Tracking performance improves consistently across TNL2K, LaSOT, TNLLT, and OTB-LANG.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The parsing-plus-update loop could be tested on trackers that already use other vision-language models besides Qwen-VL.
If the component separation proves stable, the same structure might help in tasks such as video object segmentation guided by language.
Releasing the code allows direct measurement of how often updates actually change versus stay fixed across long sequences.

Load-bearing premise

Qwen-VL can accurately extract and update only the relevant tracking components without adding wrong information or hallucinations.

What would settle it

Running the method on TNL2K or LaSOT sequences with large appearance changes and measuring whether tracking accuracy drops or hallucinated descriptions appear in the updated text.

Figures

Figures reproduced from arXiv: 2606.29357 by Dan Xu, Jin Tang, Lan Chen, Liye Jin, Xiao Wang, Yaowei Wang, Yonghong Tian, Yuehang Li.

**Figure 1.** Figure 1: Comparison between existing language-updating VLT paradigms and our proposed fine-grained text update framework. (a) Existing VLT methods typically perform holistic language updating, where the updated description may drift toward salient objects, background context, or hallucinated cues. (b) Our framework introduces a fine-grained text update paradigm by decomposing the query into target, concept, and bac… view at source ↗

**Figure 2.** Figure 2: An overview of the proposed fine-grained text update guided tracking framework. Given template and search frames with a language query, we first use a dependency-parsing (DP) module to decompose the query into a structured triplet (target, concept, background). The triplet is refined by Qwen to improve semantic quality. Then, target-conditioned Top-K visual evidence is used to update concept spans via span… view at source ↗

**Figure 3.** Figure 3: Architecture of the dependency parsing module. The module encodes the raw query with word and POS embeddings, predicts dependency arcs and relation labels through biaffine classifiers, and converts the resulting dependency tree into a structured triplet consisting of target, concept, and background fields. The concept field t con is then constructed from the remaining nominal tokens that do not belong to … view at source ↗

**Figure 4.** Figure 4: Visualization of dependency-based triplet parsing. For each example, the raw query is parsed into token-level dependency relations and converted into a structured triplet consisting of target, concept, and background fields. The upper triplet in each row shows the initial parsing result, while the lower triplet shows the refined language state used by our tracker. The visualization illustrates that the pro… view at source ↗

**Figure 5.** Figure 5: Tracking results on TNLLT. PR, NPR, and SR curves are reported to compare different trackers under varying evaluation thresholds. Our method achieves consistently strong performance, showing the benefit of fine-grained text updating for robust vision-language tracking. TABLE I OVERALL TRACKING PERFORMANCE ON TNLLT DATASET. Trackers Source Type PR NPR SR OSTrack [47] ECCV 2022 BB 57.3 63.6 52.1 MixFormer [4… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on TNLLT. We compare our tracker with three representative trackers, including AiATrack, OSTrack, and JointNLT, on three challenging sequences. Each row corresponds to one video sequence and contains sampled keyframes, the raw language query, the structured triplet used by our method, and tracking results from different trackers. In the structured triplet, the target field captures s… view at source ↗

**Figure 7.** Figure 7: Attribute-based robustness analysis on TNLLT. The radar chart compares different trackers across 15 challenging attributes. Larger values indicate better robustness under the corresponding attribute. attributes, achieving 62.85 on MB and 68.99 on FM. Although our method is slightly behind the best-performing tracker on VC, BC, and LR, the performance gaps are marginal. These results indicate that the propo… view at source ↗

read the original abstract

Vision-language tracking guided by natural language specifications leverages high-level semantic cues of target objects to substantially boost tracking accuracy and robustness. Existing studies have verified that adaptively optimizing textual descriptions throughout the tracking process can effectively mitigate the semantic-visual mismatch induced by dynamic variations in target appearance, position, and other inherent attributes. Nevertheless, mainstream methods that directly generate textual information via sequence models or large language models inevitably suffer from inherent defects, including erroneous target updating, excessive background distraction, and pervasive hallucination artifacts. To address the aforementioned limitations, this paper proposes a novel language dependency parsing mechanism to precisely distill core tracking principal components, encompassing target objects, semantic concepts, and background contextual information. On this basis, we perform component-aware adaptive textual description updates by exploiting the powerful cross-modal understanding capability of the pre-trained vision-language model Qwen-VL. By integrating the proposed elaborately designed modules into the baseline framework, our method achieves consistent and superior tracking performance on multiple large-scale vision-language tracking benchmarks, including TNL2K, LaSOT, TNLLT, and OTB-LANG. The source code and pre-trained models will be released at https://github.com/Event-AHU/Open_VLTrack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is a dependency parsing step plus Qwen-VL for component-aware text updates in vision-language tracking, but the performance edge rests on the VLM handling distillation and updates without new errors.

read the letter

The main thing here is a language dependency parsing mechanism that splits natural language specs into target objects, semantic concepts, and background context, then feeds those into Qwen-VL for adaptive, component-aware updates during tracking. The claim is that this beats direct generation from sequence models or LLMs by cutting erroneous updates, background noise, and hallucinations, with reported gains on TNL2K, LaSOT, TNLLT, and OTB-LANG.

What stands out as new is the explicit parsing step before the VLM update; prior work mentioned in the abstract does not appear to structure the input this way. The paper does a clean job naming the concrete failure modes in existing adaptive-text methods and tying the fix to a pre-trained VLM's cross-modal strengths.

The soft spot is the load-bearing assumption that Qwen-VL will reliably extract the right components and produce clean updates without introducing its own artifacts. The abstract gives no prompting details, no failure-mode tests, and no error analysis, so the stress-test concern lands: if the VLM does not behave as hoped, the reported gains cannot be credited to the parsing and update modules. Without seeing the full experiments or ablations, it is difficult to judge whether the results are robust or test-set specific.

This is incremental work aimed at people already building vision-language trackers who need better handling of appearance and attribute changes. A reader in that niche could pick up the module design and try it. It deserves a serious referee because the problem is real, the proposal is concrete, and the benchmarks are standard, even if the VLM reliability piece needs scrutiny.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a language dependency parsing mechanism to distill core tracking principal components (target objects, semantic concepts, and background contextual information) from natural language specifications and employs the pre-trained vision-language model Qwen-VL for component-aware adaptive textual description updates. By integrating these into a baseline framework, the method is claimed to achieve superior tracking performance on the TNL2K, LaSOT, TNLLT, and OTB-LANG benchmarks while avoiding issues like erroneous target updating and hallucination artifacts.

Significance. Should the empirical results hold and the VLM-based updates prove reliable, this work could advance vision-language tracking by providing a more structured and less error-prone way to dynamically update language specifications, potentially influencing future designs in multi-modal tracking systems. The commitment to releasing code and models supports reproducibility.

major comments (1)

[Proposed method] The assumption that Qwen-VL has the powerful cross-modal understanding capability to precisely distill core tracking principal components and perform component-aware adaptive textual description updates without introducing erroneous target updating or hallucination artifacts is load-bearing for the central claim of superior performance (as stated in the abstract); however, the manuscript provides no mechanism details, prompting strategy, or robustness checks against typical VLM failure modes to substantiate this.

minor comments (1)

[Abstract] The abstract could benefit from a brief mention of the baseline framework used for integration to provide context for the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Proposed method] The assumption that Qwen-VL has the powerful cross-modal understanding capability to precisely distill core tracking principal components and perform component-aware adaptive textual description updates without introducing erroneous target updating or hallucination artifacts is load-bearing for the central claim of superior performance (as stated in the abstract); however, the manuscript provides no mechanism details, prompting strategy, or robustness checks against typical VLM failure modes to substantiate this.

Authors: We agree that the current version lacks explicit mechanism details, prompting strategy, and robustness analysis for the Qwen-VL component. In the revision we will add a dedicated subsection describing the exact prompting templates used for component distillation and adaptive updates, together with ablation studies and qualitative failure-case analysis that directly test against hallucination and erroneous target updating. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark claims are independent of inputs

full rationale

The paper describes a proposed method (language dependency parsing + Qwen-VL component-aware updates) and asserts superior tracking performance on TNL2K, LaSOT, TNLLT, and OTB-LANG after integration into a baseline. No equations, fitted parameters, or derivation chain appear in the provided text. The performance claim is presented as an empirical outcome rather than a quantity forced by construction from the method's own definitions or prior self-citations. No load-bearing step reduces to self-definition, renaming, or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed parsing and updating modules, which depend on the capabilities of existing VLMs and the assumption that parsing can precisely extract components. No free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Pre-trained VLMs like Qwen-VL have powerful cross-modal understanding capabilities for distilling tracking components.
Invoked in the abstract when describing the update process.

pith-pipeline@v0.9.1-grok · 5764 in / 1341 out tokens · 45773 ms · 2026-06-30T07:18:09.394687+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Tracking by natural language specification,

Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders, “Tracking by natural language specification,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2017, pp. 6495–6503

2017
[2]

Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13 763–13 773

2021
[3]

Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking

X. Wang, C. Li, R. Yang, T. Zhang, J. Tang, and B. Luo, “Describe and attend to track: Learning natural language guided structural rep- resentation and visual attention for object tracking,”arXiv preprint arXiv:1811.10014, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Grounding-tracking- integration,

Z. Yang, T. Kumar, T. Chen, J. Su, and J. Luo, “Grounding-tracking- integration,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3433–3443, 2020

2020
[5]

Joint visual grounding and tracking with natural language specification,

L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 151–23 160

2023
[6]

Large-scale multi-modal pre-trained models: A comprehensive survey,

X. Wang, G. Chen, G. Qian, P. Gao, X.-Y . Wei, Y . Wang, Y . Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,”Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023

2023
[7]

Llmformer: Large language model for open-vocabulary semantic segmentation,

H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025

2025
[8]

Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,

J. Jin, X. Wang, Q. Zhu, H. Wang, and C. Li, “Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4138–4146

2025
[9]

Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,

G. Li, S. Zhuang, Y . Jian, Y . Yan, and H. Wang, “Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 23 626–23 635

2025
[10]

Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,

Y . Sun, F. Yu, S. Chen, Y . Zhang, J. Huang, Y . Li, C. Li, and C. Wang, “Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,”Advances in Neural Information Processing Systems, vol. 37, pp. 39 303–39 324, 2024

2024
[11]

Dynamic updates for language adaptation in visual-language tracking,

X. Li, B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,”arXiv preprint arXiv:2503.06621, 2025

work page arXiv 2025
[12]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

2022
[13]

Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al

X. Luo, Y . Cai, J. Liu, J. Tang, G. Wu, and L. Wang, “Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al.”International Journal of Computer Vision, vol. 134, no. 3, p. 121, 2026

2026
[14]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,

X. Wang, L. Jin, X. Lou, S. Wang, L. Chen, B. Jiang, and Z. Zhang, “Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,”arXiv preprint arXiv:2508.05221, 2025

work page arXiv 2025
[16]

Lasot: A high-quality benchmark for large-scale single object tracking,

H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y . Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383

2019
[17]

Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 763–13 773

2021
[18]

Divert more attention to vision- language tracking,

M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision- language tracking,”Advances in Neural Information Processing Systems, vol. 35, pp. 4446–4460, 2022

2022
[19]

All in one: Exploring unified vision-language tracking with multi-modal alignment,

C. Zhang, X. Sun, Y . Yang, L. Liu, Q. Liu, X. Zhou, and Y . Wang, “All in one: Exploring unified vision-language tracking with multi-modal alignment,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5552–5561

2023
[20]

Toward unified token learning for vision-language tracking,

Y . Zheng, B. Zhong, Q. Liang, G. Li, R. Ji, and X. Li, “Toward unified token learning for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2125– 2135, 2023

2023
[21]

Context- aware integration of language and visual references for natural language tracking,

Y . Shao, S. He, Q. Ye, Y . Feng, W. Luo, and J. Chen, “Context- aware integration of language and visual references for natural language tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 208–19 217

2024
[22]

Unifying visual and vision-language tracking via contrastive learning,

Y . Ma, Y . Tang, W. Yang, T. Zhang, J. Zhang, and M. Kang, “Unifying visual and vision-language tracking via contrastive learning,” 2024

2024
[23]

Mambavlt: Time- evolving multimodal state space model for vision-language tracking,

X. Liu, L. Zhou, Z. Zhou, J. Chen, and Z. He, “Mambavlt: Time- evolving multimodal state space model for vision-language tracking,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 8731–8741

2025
[24]

Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,

X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, and K. Huang, “Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 850–19 861

2025
[25]

Cross-modal retrieval via deep and bidirectional representation learning,

Y . He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,”IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1363–1377, 2016

2016
[26]

Adversarial attribute-text embed- ding for person search with natural language query,

Z.-J. Zha, J. Liu, D. Chen, and F. Wu, “Adversarial attribute-text embed- ding for person search with natural language query,”IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1836–1846, 2020

2020
[27]

Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,

X. Wang, L. Zhu, Z. Zheng, M. Xu, and Y . Yang, “Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,”IEEE Transactions on Multimedia, 2022

2022
[28]

Multi-task cnn model for attribute prediction,

A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task cnn model for attribute prediction,”IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1949–1959, 2015

1949
[29]

Correlation graph convolu- tional network for pedestrian attribute recognition,

H. Fan, H.-M. Hu, S. Liu, W. Lu, and S. Pu, “Correlation graph convolu- tional network for pedestrian attribute recognition,”IEEE Transactions on Multimedia, vol. 24, pp. 49–60, 2022

2022
[30]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[31]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[33]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rä- dle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

2025
[34]

Florence-2: Advancing a unified representation for a variety of vision tasks,

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

2024
[35]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report, 2025,”URL https://arxiv. org/abs/2502.13923, vol. 6, pp. 13–23, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

R1-track: Direct application of mllms to visual object tracking via reinforcement learning,

B. Wang and W. Li, “R1-track: Direct application of mllms to visual object tracking via reinforcement learning,”arXiv preprint arXiv:2506.21980, 2025

work page arXiv 2025
[38]

VPTracker: Global Vision-Language Tracking via Visual Prompt

J. Wang, K. Zhou, Z. Wu, K. Ji, D. Huang, and Y . Zheng, “Vptracker: Global vision-language tracking via visual prompt and mllm,”arXiv preprint arXiv:2512.22799, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Llm- track: Semantic multi-object tracking with multi-modal large language models,

P. Liao, F. Yang, D. Wu, J. Yu, Y . Zhu, W. Zhao, and D. Zhang, “Llm- track: Semantic multi-object tracking with multi-modal large language models,”arXiv preprint arXiv:2601.06550, 2026

work page arXiv 2026
[40]

Deep biaffine attention for neural dependency parsing,

T. Dozat and C. D. Manning, “Deep biaffine attention for neural dependency parsing,” inInternational Conference on Learning Repre- sentations, 2017. IEEE TRANSACTIONS ON ***, 2026 14

2017
[41]

Universal dependencies v1: A multilingual treebank collection,

J. Nivre, M.-C. de Marneffe, F. Ginter, Y . Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman, “Universal dependencies v1: A multilingual treebank collection,” inProceedings of the Tenth International Conference on Language Resources and Evaluation, 2016, pp. 1659–1666

2016
[42]

A gold standard dependency corpus for english,

N. Silveira, T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning, “A gold standard dependency corpus for english,” inProceedings of the Ninth International Conference on Language Resources and Evaluation, 2014, pp. 2897–2904

2014
[43]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

2022
[44]

Generalized intersection over union: A metric and a loss for bound- ing box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bound- ing box regression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 658–666

2019
[45]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

2017
[46]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

2019
[47]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inECCV, 2022

2022
[48]

Mixformer: End-to-end tracking with iterative mixed attention,

Y . Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 13 608– 13 618

2022
[49]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146–164

2022
[50]

Citetracker: Correlating image and text for visual tracking,

X. Li, Y . Huang, Z. He, Y . Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” inICCV, 2023

2023
[51]

Robust object modeling for visual tracking,

Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

2023
[52]

Generalized relation modeling for transformer tracking,

S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 686–18 695

2023
[53]

Odtrack: Online dense temporal token learning for visual tracking,

Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024

2024
[54]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inAAAI, 2024

2024
[55]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309

2024
[56]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832

2025
[57]

Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,

X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, and K. Huang, “Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[58]

Sutrack: Towards simple and unified single object tracking,

X. Chen, B. Kang, W. Geng, J. Zhu, Y . Liu, D. Wang, and H. Lu, “Sutrack: Towards simple and unified single object tracking,” 2025

2025
[59]

Dynamic updates for language adaptation in visual-language tracking,

X. Li, , B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,” inCVPR, 2025

2025
[60]

Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,

T. Bai, S. Yang, Y . Wang, and G. Zhang, “Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,” Expert Systems with Applications, p. 132255, 2026

2026
[61]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 572–14 581

2023
[62]

Tdcl: Dense seman- tic contrastive learning for vision-language tracking,

Z. Wang, X. He, K. Lan, Y . Cui, and D. Guo, “Tdcl: Dense seman- tic contrastive learning for vision-language tracking,” inECAI 2024: 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain–Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024). SAGE Publications Pvt. Ltd 1 O...

2024
[63]

Consistencies are all you need for semi-supervised vision-language tracking,

J. Ge, J. Cao, X. Zhu, X. Zhang, C. Liu, K. Wang, and B. Liu, “Consistencies are all you need for semi-supervised vision-language tracking,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1895–1904

2024
[64]

One-stream stepwise decreasing for vision-language tracking,

G. Zhang, B. Zhong, Q. Liang, Z. Mo, N. Li, and S. Song, “One-stream stepwise decreasing for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9053– 9063, 2024

2024
[65]

Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,

Y . Zhen, Q. Wang, Y . Qiao, L. Qu, and H. Fan, “Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,”arXiv preprint arXiv:2507.00454, 2025

work page arXiv 2025
[66]

Robust tracking via mamba-based context-aware token learning,

J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba-based context-aware token learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8727–8735

2025
[67]

Aware distilla- tion for robust vision-language tracking under linguistic sparsity,

G. Zhang, B. Zhong, S. Yang, Y . Wang, and T. Bai, “Aware distilla- tion for robust vision-language tracking under linguistic sparsity,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 439–12 447

2026
[68]

Vision-language tracking with attention-based optimization,

S. Hu, T. Liu, L. Han, and R. Xing, “Vision-language tracking with attention-based optimization,”Journal of Visual Communication and Image Representation, vol. 114, p. 104644, 2026

2026
[69]

Sematrack: semantic-driven unified vision-language tracking: J. zhang et al

J. Zhang, L. Xu, H. Zhang, X. Yan, B. Jiang, and J. Wu, “Sematrack: semantic-driven unified vision-language tracking: J. zhang et al.”The Visual Computer, vol. 42, no. 2, p. 142, 2026

2026
[70]

Rwkv-inspired multi-modal relation modeling for vision-language tracking,

G. Zhang, Y . Wang, B. Zhong, Y . Mu, and T. Bai, “Rwkv-inspired multi-modal relation modeling for vision-language tracking,”IEEE Transactions on Multimedia, 2026

2026
[71]

Transformer tracking with cyclic shifting window attention,

Z. Song, J. Yu, Y .-P. P. Chen, and W. Yang, “Transformer tracking with cyclic shifting window attention,” 2022. [Online]. Available: https://arxiv.org/abs/2205.03806

work page arXiv 2022
[72]

Transformer vision- language tracking via proxy token guided cross-modal fusion,

H. Zhao, X. Wang, D. Wang, H. Lu, and X. Ruan, “Transformer vision- language tracking via proxy token guided cross-modal fusion,”Pattern Recognition Letters, vol. 168, pp. 10–16, 2023

2023
[73]

Real-time visual object tracking with natural language description,

Q. Feng, V . Ablavsky, Q. Bai, G. Li, and S. Sclaroff, “Real-time visual object tracking with natural language description,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 700–709

2020
[74]

Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,

Q. Feng, V . Ablavsky, Q. Bai, and S. Sclaroff, “Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5851–5860

2021

[1] [1]

Tracking by natural language specification,

Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders, “Tracking by natural language specification,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2017, pp. 6495–6503

2017

[2] [2]

Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13 763–13 773

2021

[3] [3]

Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking

X. Wang, C. Li, R. Yang, T. Zhang, J. Tang, and B. Luo, “Describe and attend to track: Learning natural language guided structural rep- resentation and visual attention for object tracking,”arXiv preprint arXiv:1811.10014, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Grounding-tracking- integration,

Z. Yang, T. Kumar, T. Chen, J. Su, and J. Luo, “Grounding-tracking- integration,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3433–3443, 2020

2020

[5] [5]

Joint visual grounding and tracking with natural language specification,

L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 151–23 160

2023

[6] [6]

Large-scale multi-modal pre-trained models: A comprehensive survey,

X. Wang, G. Chen, G. Qian, P. Gao, X.-Y . Wei, Y . Wang, Y . Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,”Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023

2023

[7] [7]

Llmformer: Large language model for open-vocabulary semantic segmentation,

H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025

2025

[8] [8]

Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,

J. Jin, X. Wang, Q. Zhu, H. Wang, and C. Li, “Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4138–4146

2025

[9] [9]

Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,

G. Li, S. Zhuang, Y . Jian, Y . Yan, and H. Wang, “Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 23 626–23 635

2025

[10] [10]

Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,

Y . Sun, F. Yu, S. Chen, Y . Zhang, J. Huang, Y . Li, C. Li, and C. Wang, “Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,”Advances in Neural Information Processing Systems, vol. 37, pp. 39 303–39 324, 2024

2024

[11] [11]

Dynamic updates for language adaptation in visual-language tracking,

X. Li, B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,”arXiv preprint arXiv:2503.06621, 2025

work page arXiv 2025

[12] [12]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

2022

[13] [13]

Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al

X. Luo, Y . Cai, J. Liu, J. Tang, G. Wu, and L. Wang, “Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al.”International Journal of Computer Vision, vol. 134, no. 3, p. 121, 2026

2026

[14] [14]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,

X. Wang, L. Jin, X. Lou, S. Wang, L. Chen, B. Jiang, and Z. Zhang, “Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,”arXiv preprint arXiv:2508.05221, 2025

work page arXiv 2025

[16] [16]

Lasot: A high-quality benchmark for large-scale single object tracking,

H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y . Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383

2019

[17] [17]

Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 763–13 773

2021

[18] [18]

Divert more attention to vision- language tracking,

M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision- language tracking,”Advances in Neural Information Processing Systems, vol. 35, pp. 4446–4460, 2022

2022

[19] [19]

All in one: Exploring unified vision-language tracking with multi-modal alignment,

C. Zhang, X. Sun, Y . Yang, L. Liu, Q. Liu, X. Zhou, and Y . Wang, “All in one: Exploring unified vision-language tracking with multi-modal alignment,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5552–5561

2023

[20] [20]

Toward unified token learning for vision-language tracking,

Y . Zheng, B. Zhong, Q. Liang, G. Li, R. Ji, and X. Li, “Toward unified token learning for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2125– 2135, 2023

2023

[21] [21]

Context- aware integration of language and visual references for natural language tracking,

Y . Shao, S. He, Q. Ye, Y . Feng, W. Luo, and J. Chen, “Context- aware integration of language and visual references for natural language tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 208–19 217

2024

[22] [22]

Unifying visual and vision-language tracking via contrastive learning,

Y . Ma, Y . Tang, W. Yang, T. Zhang, J. Zhang, and M. Kang, “Unifying visual and vision-language tracking via contrastive learning,” 2024

2024

[23] [23]

Mambavlt: Time- evolving multimodal state space model for vision-language tracking,

X. Liu, L. Zhou, Z. Zhou, J. Chen, and Z. He, “Mambavlt: Time- evolving multimodal state space model for vision-language tracking,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 8731–8741

2025

[24] [24]

Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,

X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, and K. Huang, “Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 850–19 861

2025

[25] [25]

Cross-modal retrieval via deep and bidirectional representation learning,

Y . He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,”IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1363–1377, 2016

2016

[26] [26]

Adversarial attribute-text embed- ding for person search with natural language query,

Z.-J. Zha, J. Liu, D. Chen, and F. Wu, “Adversarial attribute-text embed- ding for person search with natural language query,”IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1836–1846, 2020

2020

[27] [27]

Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,

X. Wang, L. Zhu, Z. Zheng, M. Xu, and Y . Yang, “Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,”IEEE Transactions on Multimedia, 2022

2022

[28] [28]

Multi-task cnn model for attribute prediction,

A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task cnn model for attribute prediction,”IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1949–1959, 2015

1949

[29] [29]

Correlation graph convolu- tional network for pedestrian attribute recognition,

H. Fan, H.-M. Hu, S. Liu, W. Lu, and S. Pu, “Correlation graph convolu- tional network for pedestrian attribute recognition,”IEEE Transactions on Multimedia, vol. 24, pp. 49–60, 2022

2022

[30] [30]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021

[31] [31]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023

[33] [33]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rä- dle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

2025

[34] [34]

Florence-2: Advancing a unified representation for a variety of vision tasks,

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

2024

[35] [35]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report, 2025,”URL https://arxiv. org/abs/2502.13923, vol. 6, pp. 13–23, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

R1-track: Direct application of mllms to visual object tracking via reinforcement learning,

B. Wang and W. Li, “R1-track: Direct application of mllms to visual object tracking via reinforcement learning,”arXiv preprint arXiv:2506.21980, 2025

work page arXiv 2025

[38] [38]

VPTracker: Global Vision-Language Tracking via Visual Prompt

J. Wang, K. Zhou, Z. Wu, K. Ji, D. Huang, and Y . Zheng, “Vptracker: Global vision-language tracking via visual prompt and mllm,”arXiv preprint arXiv:2512.22799, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Llm- track: Semantic multi-object tracking with multi-modal large language models,

P. Liao, F. Yang, D. Wu, J. Yu, Y . Zhu, W. Zhao, and D. Zhang, “Llm- track: Semantic multi-object tracking with multi-modal large language models,”arXiv preprint arXiv:2601.06550, 2026

work page arXiv 2026

[40] [40]

Deep biaffine attention for neural dependency parsing,

T. Dozat and C. D. Manning, “Deep biaffine attention for neural dependency parsing,” inInternational Conference on Learning Repre- sentations, 2017. IEEE TRANSACTIONS ON ***, 2026 14

2017

[41] [41]

Universal dependencies v1: A multilingual treebank collection,

J. Nivre, M.-C. de Marneffe, F. Ginter, Y . Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman, “Universal dependencies v1: A multilingual treebank collection,” inProceedings of the Tenth International Conference on Language Resources and Evaluation, 2016, pp. 1659–1666

2016

[42] [42]

A gold standard dependency corpus for english,

N. Silveira, T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning, “A gold standard dependency corpus for english,” inProceedings of the Ninth International Conference on Language Resources and Evaluation, 2014, pp. 2897–2904

2014

[43] [43]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

2022

[44] [44]

Generalized intersection over union: A metric and a loss for bound- ing box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bound- ing box regression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 658–666

2019

[45] [45]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

2017

[46] [46]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

2019

[47] [47]

Joint feature learning and relation modeling for tracking: A one-stream framework,

B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inECCV, 2022

2022

[48] [48]

Mixformer: End-to-end tracking with iterative mixed attention,

Y . Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 13 608– 13 618

2022

[49] [49]

Aiatrack: Attention in attention for transformer visual tracking,

S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146–164

2022

[50] [50]

Citetracker: Correlating image and text for visual tracking,

X. Li, Y . Huang, Z. He, Y . Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” inICCV, 2023

2023

[51] [51]

Robust object modeling for visual tracking,

Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

2023

[52] [52]

Generalized relation modeling for transformer tracking,

S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 686–18 695

2023

[53] [53]

Odtrack: Online dense temporal token learning for visual tracking,

Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024

2024

[54] [54]

Explicit visual prompts for visual object tracking,

L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inAAAI, 2024

2024

[55] [55]

Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309

2024

[56] [56]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832

2025

[57] [57]

Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,

X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, and K. Huang, “Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[58] [58]

Sutrack: Towards simple and unified single object tracking,

X. Chen, B. Kang, W. Geng, J. Zhu, Y . Liu, D. Wang, and H. Lu, “Sutrack: Towards simple and unified single object tracking,” 2025

2025

[59] [59]

Dynamic updates for language adaptation in visual-language tracking,

X. Li, , B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,” inCVPR, 2025

2025

[60] [60]

Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,

T. Bai, S. Yang, Y . Wang, and G. Zhang, “Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,” Expert Systems with Applications, p. 132255, 2026

2026

[61] [61]

Seqtrack: Sequence to sequence learning for visual object tracking,

X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 572–14 581

2023

[62] [62]

Tdcl: Dense seman- tic contrastive learning for vision-language tracking,

Z. Wang, X. He, K. Lan, Y . Cui, and D. Guo, “Tdcl: Dense seman- tic contrastive learning for vision-language tracking,” inECAI 2024: 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain–Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024). SAGE Publications Pvt. Ltd 1 O...

2024

[63] [63]

Consistencies are all you need for semi-supervised vision-language tracking,

J. Ge, J. Cao, X. Zhu, X. Zhang, C. Liu, K. Wang, and B. Liu, “Consistencies are all you need for semi-supervised vision-language tracking,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1895–1904

2024

[64] [64]

One-stream stepwise decreasing for vision-language tracking,

G. Zhang, B. Zhong, Q. Liang, Z. Mo, N. Li, and S. Song, “One-stream stepwise decreasing for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9053– 9063, 2024

2024

[65] [65]

Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,

Y . Zhen, Q. Wang, Y . Qiao, L. Qu, and H. Fan, “Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,”arXiv preprint arXiv:2507.00454, 2025

work page arXiv 2025

[66] [66]

Robust tracking via mamba-based context-aware token learning,

J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba-based context-aware token learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8727–8735

2025

[67] [67]

Aware distilla- tion for robust vision-language tracking under linguistic sparsity,

G. Zhang, B. Zhong, S. Yang, Y . Wang, and T. Bai, “Aware distilla- tion for robust vision-language tracking under linguistic sparsity,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 439–12 447

2026

[68] [68]

Vision-language tracking with attention-based optimization,

S. Hu, T. Liu, L. Han, and R. Xing, “Vision-language tracking with attention-based optimization,”Journal of Visual Communication and Image Representation, vol. 114, p. 104644, 2026

2026

[69] [69]

Sematrack: semantic-driven unified vision-language tracking: J. zhang et al

J. Zhang, L. Xu, H. Zhang, X. Yan, B. Jiang, and J. Wu, “Sematrack: semantic-driven unified vision-language tracking: J. zhang et al.”The Visual Computer, vol. 42, no. 2, p. 142, 2026

2026

[70] [70]

Rwkv-inspired multi-modal relation modeling for vision-language tracking,

G. Zhang, Y . Wang, B. Zhong, Y . Mu, and T. Bai, “Rwkv-inspired multi-modal relation modeling for vision-language tracking,”IEEE Transactions on Multimedia, 2026

2026

[71] [71]

Transformer tracking with cyclic shifting window attention,

Z. Song, J. Yu, Y .-P. P. Chen, and W. Yang, “Transformer tracking with cyclic shifting window attention,” 2022. [Online]. Available: https://arxiv.org/abs/2205.03806

work page arXiv 2022

[72] [72]

Transformer vision- language tracking via proxy token guided cross-modal fusion,

H. Zhao, X. Wang, D. Wang, H. Lu, and X. Ruan, “Transformer vision- language tracking via proxy token guided cross-modal fusion,”Pattern Recognition Letters, vol. 168, pp. 10–16, 2023

2023

[73] [73]

Real-time visual object tracking with natural language description,

Q. Feng, V . Ablavsky, Q. Bai, G. Li, and S. Sclaroff, “Real-time visual object tracking with natural language description,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 700–709

2020

[74] [74]

Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,

Q. Feng, V . Ablavsky, Q. Bai, and S. Sclaroff, “Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5851–5860

2021