pith. sign in

arxiv: 2606.29357 · v1 · pith:RWYDAMARnew · submitted 2026-06-28 · 💻 cs.CV · cs.AI· cs.LG

Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

Pith reviewed 2026-06-30 07:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords vision-language trackingnatural language specificationdynamic parsingadaptive textual updateQwen-VLtracking benchmarks
0
0 comments X

The pith

A language dependency parsing mechanism with Qwen-VL enables dynamic, component-aware updates to natural language specifications for vision-language tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing vision-language trackers suffer from semantic-visual mismatches when target appearance or position changes because they generate or update textual descriptions in error-prone ways. It introduces a parsing step that breaks natural language specs into target objects, semantic concepts, and background context, then uses Qwen-VL to update only the relevant components adaptively. This is shown to reduce erroneous updates and hallucinations while delivering higher accuracy than baselines. A sympathetic reader would care because reliable natural-language guidance could make trackers more usable in real scenes where objects deform or lighting shifts.

Core claim

A novel language dependency parsing mechanism distills core tracking principal components encompassing target objects, semantic concepts, and background contextual information; component-aware adaptive textual description updates are then performed by exploiting the cross-modal understanding capability of the pre-trained vision-language model Qwen-VL. Integrating these modules into the baseline yields consistent and superior tracking performance on TNL2K, LaSOT, TNLLT, and OTB-LANG.

What carries the argument

Language dependency parsing mechanism that separates target objects, semantic concepts, and background context, followed by component-aware adaptive updates driven by Qwen-VL.

If this is right

  • Semantic-visual mismatch from dynamic target variations is reduced.
  • Erroneous target updating, background distraction, and hallucination artifacts are avoided.
  • Tracking performance improves consistently across TNL2K, LaSOT, TNLLT, and OTB-LANG.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The parsing-plus-update loop could be tested on trackers that already use other vision-language models besides Qwen-VL.
  • If the component separation proves stable, the same structure might help in tasks such as video object segmentation guided by language.
  • Releasing the code allows direct measurement of how often updates actually change versus stay fixed across long sequences.

Load-bearing premise

Qwen-VL can accurately extract and update only the relevant tracking components without adding wrong information or hallucinations.

What would settle it

Running the method on TNL2K or LaSOT sequences with large appearance changes and measuring whether tracking accuracy drops or hallucinated descriptions appear in the updated text.

Figures

Figures reproduced from arXiv: 2606.29357 by Dan Xu, Jin Tang, Lan Chen, Liye Jin, Xiao Wang, Yaowei Wang, Yonghong Tian, Yuehang Li.

Figure 1
Figure 1. Figure 1: Comparison between existing language-updating VLT paradigms and our proposed fine-grained text update framework. (a) Existing VLT methods typically perform holistic language updating, where the updated description may drift toward salient objects, background context, or hallucinated cues. (b) Our framework introduces a fine-grained text update paradigm by decomposing the query into target, concept, and bac… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the proposed fine-grained text update guided tracking framework. Given template and search frames with a language query, we first use a dependency-parsing (DP) module to decompose the query into a structured triplet (target, concept, background). The triplet is refined by Qwen to improve semantic quality. Then, target-conditioned Top-K visual evidence is used to update concept spans via span… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the dependency parsing module. The module encodes the raw query with word and POS embeddings, predicts dependency arcs and relation labels through biaffine classifiers, and converts the resulting dependency tree into a structured triplet consisting of target, concept, and background fields. The concept field t con is then constructed from the remain￾ing nominal tokens that do not belong to … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of dependency-based triplet parsing. For each example, the raw query is parsed into token-level dependency relations and converted into a structured triplet consisting of target, concept, and background fields. The upper triplet in each row shows the initial parsing result, while the lower triplet shows the refined language state used by our tracker. The visualization illustrates that the pro… view at source ↗
Figure 5
Figure 5. Figure 5: Tracking results on TNLLT. PR, NPR, and SR curves are reported to compare different trackers under varying evaluation thresholds. Our method achieves consistently strong performance, showing the benefit of fine-grained text updating for robust vision-language tracking. TABLE I OVERALL TRACKING PERFORMANCE ON TNLLT DATASET. Trackers Source Type PR NPR SR OSTrack [47] ECCV 2022 BB 57.3 63.6 52.1 MixFormer [4… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on TNLLT. We compare our tracker with three representative trackers, including AiATrack, OSTrack, and JointNLT, on three challenging sequences. Each row corresponds to one video sequence and contains sampled keyframes, the raw language query, the structured triplet used by our method, and tracking results from different trackers. In the structured triplet, the target field captures s… view at source ↗
Figure 7
Figure 7. Figure 7: Attribute-based robustness analysis on TNLLT. The radar chart compares different trackers across 15 challenging attributes. Larger values indicate better robustness under the corresponding attribute. attributes, achieving 62.85 on MB and 68.99 on FM. Although our method is slightly behind the best-performing tracker on VC, BC, and LR, the performance gaps are marginal. These results indicate that the propo… view at source ↗
read the original abstract

Vision-language tracking guided by natural language specifications leverages high-level semantic cues of target objects to substantially boost tracking accuracy and robustness. Existing studies have verified that adaptively optimizing textual descriptions throughout the tracking process can effectively mitigate the semantic-visual mismatch induced by dynamic variations in target appearance, position, and other inherent attributes. Nevertheless, mainstream methods that directly generate textual information via sequence models or large language models inevitably suffer from inherent defects, including erroneous target updating, excessive background distraction, and pervasive hallucination artifacts. To address the aforementioned limitations, this paper proposes a novel language dependency parsing mechanism to precisely distill core tracking principal components, encompassing target objects, semantic concepts, and background contextual information. On this basis, we perform component-aware adaptive textual description updates by exploiting the powerful cross-modal understanding capability of the pre-trained vision-language model Qwen-VL. By integrating the proposed elaborately designed modules into the baseline framework, our method achieves consistent and superior tracking performance on multiple large-scale vision-language tracking benchmarks, including TNL2K, LaSOT, TNLLT, and OTB-LANG. The source code and pre-trained models will be released at https://github.com/Event-AHU/Open_VLTrack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a language dependency parsing mechanism to distill core tracking principal components (target objects, semantic concepts, and background contextual information) from natural language specifications and employs the pre-trained vision-language model Qwen-VL for component-aware adaptive textual description updates. By integrating these into a baseline framework, the method is claimed to achieve superior tracking performance on the TNL2K, LaSOT, TNLLT, and OTB-LANG benchmarks while avoiding issues like erroneous target updating and hallucination artifacts.

Significance. Should the empirical results hold and the VLM-based updates prove reliable, this work could advance vision-language tracking by providing a more structured and less error-prone way to dynamically update language specifications, potentially influencing future designs in multi-modal tracking systems. The commitment to releasing code and models supports reproducibility.

major comments (1)
  1. [Proposed method] The assumption that Qwen-VL has the powerful cross-modal understanding capability to precisely distill core tracking principal components and perform component-aware adaptive textual description updates without introducing erroneous target updating or hallucination artifacts is load-bearing for the central claim of superior performance (as stated in the abstract); however, the manuscript provides no mechanism details, prompting strategy, or robustness checks against typical VLM failure modes to substantiate this.
minor comments (1)
  1. [Abstract] The abstract could benefit from a brief mention of the baseline framework used for integration to provide context for the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Proposed method] The assumption that Qwen-VL has the powerful cross-modal understanding capability to precisely distill core tracking principal components and perform component-aware adaptive textual description updates without introducing erroneous target updating or hallucination artifacts is load-bearing for the central claim of superior performance (as stated in the abstract); however, the manuscript provides no mechanism details, prompting strategy, or robustness checks against typical VLM failure modes to substantiate this.

    Authors: We agree that the current version lacks explicit mechanism details, prompting strategy, and robustness analysis for the Qwen-VL component. In the revision we will add a dedicated subsection describing the exact prompting templates used for component distillation and adaptive updates, together with ablation studies and qualitative failure-case analysis that directly test against hallucination and erroneous target updating. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark claims are independent of inputs

full rationale

The paper describes a proposed method (language dependency parsing + Qwen-VL component-aware updates) and asserts superior tracking performance on TNL2K, LaSOT, TNLLT, and OTB-LANG after integration into a baseline. No equations, fitted parameters, or derivation chain appear in the provided text. The performance claim is presented as an empirical outcome rather than a quantity forced by construction from the method's own definitions or prior self-citations. No load-bearing step reduces to self-definition, renaming, or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed parsing and updating modules, which depend on the capabilities of existing VLMs and the assumption that parsing can precisely extract components. No free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption Pre-trained VLMs like Qwen-VL have powerful cross-modal understanding capabilities for distilling tracking components.
    Invoked in the abstract when describing the update process.

pith-pipeline@v0.9.1-grok · 5764 in / 1341 out tokens · 45773 ms · 2026-06-30T07:18:09.394687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Tracking by natural language specification,

    Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders, “Tracking by natural language specification,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2017, pp. 6495–6503

  2. [2]

    Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

    X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13 763–13 773

  3. [3]

    Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking

    X. Wang, C. Li, R. Yang, T. Zhang, J. Tang, and B. Luo, “Describe and attend to track: Learning natural language guided structural rep- resentation and visual attention for object tracking,”arXiv preprint arXiv:1811.10014, 2018

  4. [4]

    Grounding-tracking- integration,

    Z. Yang, T. Kumar, T. Chen, J. Su, and J. Luo, “Grounding-tracking- integration,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3433–3443, 2020

  5. [5]

    Joint visual grounding and tracking with natural language specification,

    L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 151–23 160

  6. [6]

    Large-scale multi-modal pre-trained models: A comprehensive survey,

    X. Wang, G. Chen, G. Qian, P. Gao, X.-Y . Wei, Y . Wang, Y . Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,”Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023

  7. [7]

    Llmformer: Large language model for open-vocabulary semantic segmentation,

    H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025

  8. [8]

    Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,

    J. Jin, X. Wang, Q. Zhu, H. Wang, and C. Li, “Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4138–4146

  9. [9]

    Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,

    G. Li, S. Zhuang, Y . Jian, Y . Yan, and H. Wang, “Language decou- pling with fine-grained knowledge guidance for referring multi-object tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 23 626–23 635

  10. [10]

    Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,

    Y . Sun, F. Yu, S. Chen, Y . Zhang, J. Huang, Y . Li, C. Li, and C. Wang, “Chattracker: Enhancing visual tracking performance via chatting with multimodal large language model,”Advances in Neural Information Processing Systems, vol. 37, pp. 39 303–39 324, 2024

  11. [11]

    Dynamic updates for language adaptation in visual-language tracking,

    X. Li, B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,”arXiv preprint arXiv:2503.06621, 2025

  12. [12]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  13. [13]

    Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al

    X. Luo, Y . Cai, J. Liu, J. Tang, G. Wu, and L. Wang, “Glad: Generative language-assisted visual tracking for low-semantic templates: X. luo et al.”International Journal of Computer Vision, vol. 134, no. 3, p. 121, 2026

  14. [14]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  15. [15]

    Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,

    X. Wang, L. Jin, X. Lou, S. Wang, L. Chen, B. Jiang, and Z. Zhang, “Reasoningtrack: Chain-of-thought reasoning for long-term vision- language tracking,”arXiv preprint arXiv:2508.05221, 2025

  16. [16]

    Lasot: A high-quality benchmark for large-scale single object tracking,

    H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y . Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383

  17. [17]

    Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,

    X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 763–13 773

  18. [18]

    Divert more attention to vision- language tracking,

    M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision- language tracking,”Advances in Neural Information Processing Systems, vol. 35, pp. 4446–4460, 2022

  19. [19]

    All in one: Exploring unified vision-language tracking with multi-modal alignment,

    C. Zhang, X. Sun, Y . Yang, L. Liu, Q. Liu, X. Zhou, and Y . Wang, “All in one: Exploring unified vision-language tracking with multi-modal alignment,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5552–5561

  20. [20]

    Toward unified token learning for vision-language tracking,

    Y . Zheng, B. Zhong, Q. Liang, G. Li, R. Ji, and X. Li, “Toward unified token learning for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2125– 2135, 2023

  21. [21]

    Context- aware integration of language and visual references for natural language tracking,

    Y . Shao, S. He, Q. Ye, Y . Feng, W. Luo, and J. Chen, “Context- aware integration of language and visual references for natural language tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 208–19 217

  22. [22]

    Unifying visual and vision-language tracking via contrastive learning,

    Y . Ma, Y . Tang, W. Yang, T. Zhang, J. Zhang, and M. Kang, “Unifying visual and vision-language tracking via contrastive learning,” 2024

  23. [23]

    Mambavlt: Time- evolving multimodal state space model for vision-language tracking,

    X. Liu, L. Zhou, Z. Zhou, J. Chen, and Z. He, “Mambavlt: Time- evolving multimodal state space model for vision-language tracking,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 8731–8741

  24. [24]

    Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,

    X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, and K. Huang, “Atctrack: Aligning target-context cues with dynamic target states for robust vision-language tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 850–19 861

  25. [25]

    Cross-modal retrieval via deep and bidirectional representation learning,

    Y . He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,”IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1363–1377, 2016

  26. [26]

    Adversarial attribute-text embed- ding for person search with natural language query,

    Z.-J. Zha, J. Liu, D. Chen, and F. Wu, “Adversarial attribute-text embed- ding for person search with natural language query,”IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1836–1846, 2020

  27. [27]

    Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,

    X. Wang, L. Zhu, Z. Zheng, M. Xu, and Y . Yang, “Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,”IEEE Transactions on Multimedia, 2022

  28. [28]

    Multi-task cnn model for attribute prediction,

    A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task cnn model for attribute prediction,”IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1949–1959, 2015

  29. [29]

    Correlation graph convolu- tional network for pedestrian attribute recognition,

    H. Fan, H.-M. Hu, S. Liu, W. Lu, and S. Pu, “Correlation graph convolu- tional network for pedestrian attribute recognition,”IEEE Transactions on Multimedia, vol. 24, pp. 49–60, 2022

  30. [30]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  32. [32]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  33. [33]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rä- dle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

  34. [34]

    Florence-2: Advancing a unified representation for a variety of vision tasks,

    B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

  35. [35]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  36. [36]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report, 2025,”URL https://arxiv. org/abs/2502.13923, vol. 6, pp. 13–23, 2025

  37. [37]

    R1-track: Direct application of mllms to visual object tracking via reinforcement learning,

    B. Wang and W. Li, “R1-track: Direct application of mllms to visual object tracking via reinforcement learning,”arXiv preprint arXiv:2506.21980, 2025

  38. [38]

    VPTracker: Global Vision-Language Tracking via Visual Prompt

    J. Wang, K. Zhou, Z. Wu, K. Ji, D. Huang, and Y . Zheng, “Vptracker: Global vision-language tracking via visual prompt and mllm,”arXiv preprint arXiv:2512.22799, 2025

  39. [39]

    Llm- track: Semantic multi-object tracking with multi-modal large language models,

    P. Liao, F. Yang, D. Wu, J. Yu, Y . Zhu, W. Zhao, and D. Zhang, “Llm- track: Semantic multi-object tracking with multi-modal large language models,”arXiv preprint arXiv:2601.06550, 2026

  40. [40]

    Deep biaffine attention for neural dependency parsing,

    T. Dozat and C. D. Manning, “Deep biaffine attention for neural dependency parsing,” inInternational Conference on Learning Repre- sentations, 2017. IEEE TRANSACTIONS ON ***, 2026 14

  41. [41]

    Universal dependencies v1: A multilingual treebank collection,

    J. Nivre, M.-C. de Marneffe, F. Ginter, Y . Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman, “Universal dependencies v1: A multilingual treebank collection,” inProceedings of the Tenth International Conference on Language Resources and Evaluation, 2016, pp. 1659–1666

  42. [42]

    A gold standard dependency corpus for english,

    N. Silveira, T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning, “A gold standard dependency corpus for english,” inProceedings of the Ninth International Conference on Language Resources and Evaluation, 2014, pp. 2897–2904

  43. [43]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

  44. [44]

    Generalized intersection over union: A metric and a loss for bound- ing box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bound- ing box regression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 658–666

  45. [45]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  46. [46]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

  47. [47]

    Joint feature learning and relation modeling for tracking: A one-stream framework,

    B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inECCV, 2022

  48. [48]

    Mixformer: End-to-end tracking with iterative mixed attention,

    Y . Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 13 608– 13 618

  49. [49]

    Aiatrack: Attention in attention for transformer visual tracking,

    S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 146–164

  50. [50]

    Citetracker: Correlating image and text for visual tracking,

    X. Li, Y . Huang, Z. He, Y . Wang, H. Lu, and M.-H. Yang, “Citetracker: Correlating image and text for visual tracking,” inICCV, 2023

  51. [51]

    Robust object modeling for visual tracking,

    Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9589–9600

  52. [52]

    Generalized relation modeling for transformer tracking,

    S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 686–18 695

  53. [53]

    Odtrack: Online dense temporal token learning for visual tracking,

    Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inAAAI, 2024

  54. [54]

    Explicit visual prompts for visual object tracking,

    L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inAAAI, 2024

  55. [55]

    Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,

    J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309

  56. [56]

    Less is more: Token context-aware learning for object tracking,

    C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832

  57. [57]

    Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,

    X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, and K. Huang, “Enhancing vision-language tracking by effectively con- verting textual cues into visual cues,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  58. [58]

    Sutrack: Towards simple and unified single object tracking,

    X. Chen, B. Kang, W. Geng, J. Zhu, Y . Liu, D. Wang, and H. Lu, “Sutrack: Towards simple and unified single object tracking,” 2025

  59. [59]

    Dynamic updates for language adaptation in visual-language tracking,

    X. Li, , B. Zhong, Q. Liang, Z. Mo, J. Nong, and S. Song, “Dynamic updates for language adaptation in visual-language tracking,” inCVPR, 2025

  60. [60]

    Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,

    T. Bai, S. Yang, Y . Wang, and G. Zhang, “Selective distillation of lan- guage tokens for redundancy suppression in vision-language tracking,” Expert Systems with Applications, p. 132255, 2026

  61. [61]

    Seqtrack: Sequence to sequence learning for visual object tracking,

    X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 572–14 581

  62. [62]

    Tdcl: Dense seman- tic contrastive learning for vision-language tracking,

    Z. Wang, X. He, K. Lan, Y . Cui, and D. Guo, “Tdcl: Dense seman- tic contrastive learning for vision-language tracking,” inECAI 2024: 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain–Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024). SAGE Publications Pvt. Ltd 1 O...

  63. [63]

    Consistencies are all you need for semi-supervised vision-language tracking,

    J. Ge, J. Cao, X. Zhu, X. Zhang, C. Liu, K. Wang, and B. Liu, “Consistencies are all you need for semi-supervised vision-language tracking,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1895–1904

  64. [64]

    One-stream stepwise decreasing for vision-language tracking,

    G. Zhang, B. Zhong, Q. Liang, Z. Mo, N. Li, and S. Song, “One-stream stepwise decreasing for vision-language tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9053– 9063, 2024

  65. [65]

    Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,

    Y . Zhen, Q. Wang, Y . Qiao, L. Qu, and H. Fan, “Atstrack: Enhancing visual-language tracking by aligning temporal and spatial scales,”arXiv preprint arXiv:2507.00454, 2025

  66. [66]

    Robust tracking via mamba-based context-aware token learning,

    J. Xie, B. Zhong, Q. Liang, N. Li, Z. Mo, and S. Song, “Robust tracking via mamba-based context-aware token learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8727–8735

  67. [67]

    Aware distilla- tion for robust vision-language tracking under linguistic sparsity,

    G. Zhang, B. Zhong, S. Yang, Y . Wang, and T. Bai, “Aware distilla- tion for robust vision-language tracking under linguistic sparsity,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 439–12 447

  68. [68]

    Vision-language tracking with attention-based optimization,

    S. Hu, T. Liu, L. Han, and R. Xing, “Vision-language tracking with attention-based optimization,”Journal of Visual Communication and Image Representation, vol. 114, p. 104644, 2026

  69. [69]

    Sematrack: semantic-driven unified vision-language tracking: J. zhang et al

    J. Zhang, L. Xu, H. Zhang, X. Yan, B. Jiang, and J. Wu, “Sematrack: semantic-driven unified vision-language tracking: J. zhang et al.”The Visual Computer, vol. 42, no. 2, p. 142, 2026

  70. [70]

    Rwkv-inspired multi-modal relation modeling for vision-language tracking,

    G. Zhang, Y . Wang, B. Zhong, Y . Mu, and T. Bai, “Rwkv-inspired multi-modal relation modeling for vision-language tracking,”IEEE Transactions on Multimedia, 2026

  71. [71]

    Transformer tracking with cyclic shifting window attention,

    Z. Song, J. Yu, Y .-P. P. Chen, and W. Yang, “Transformer tracking with cyclic shifting window attention,” 2022. [Online]. Available: https://arxiv.org/abs/2205.03806

  72. [72]

    Transformer vision- language tracking via proxy token guided cross-modal fusion,

    H. Zhao, X. Wang, D. Wang, H. Lu, and X. Ruan, “Transformer vision- language tracking via proxy token guided cross-modal fusion,”Pattern Recognition Letters, vol. 168, pp. 10–16, 2023

  73. [73]

    Real-time visual object tracking with natural language description,

    Q. Feng, V . Ablavsky, Q. Bai, G. Li, and S. Sclaroff, “Real-time visual object tracking with natural language description,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 700–709

  74. [74]

    Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,

    Q. Feng, V . Ablavsky, Q. Bai, and S. Sclaroff, “Siamese natural language tracker: Tracking by natural language descriptions with siamese track- ers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5851–5860