VPTracker: Global Vision-Language Tracking via Visual Prompt

· 2025 · cs.CV · arXiv 2512.22799

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

cs.CV · 2026-06-21 · conditional · novelty 7.0

The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.

Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

cs.CV · 2026-06-28 · unverdicted · novelty 6.0

A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

citing papers explorer

Showing 1 of 1 citing paper after filters.

4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking cs.CV · 2026-06-21 · conditional · none · ref 31 · internal anchor
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.

VPTracker: Global Vision-Language Tracking via Visual Prompt

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer