pith. machine review for the scientific record. sign in

arxiv: 2604.14074 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic multi-object trackingtraining-free pipelinevision-language modelsvideo summarizationinstance captioninginteraction recognitionmodel compositionBenSMOT benchmark
0
0 comments X

The pith

A training-free pipeline composes pretrained vision and language models to track multiple objects and generate semantic descriptions in video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TF-SMOT as a way to perform semantic multi-object tracking without any end-to-end training or fine-tuning. It does this by chaining together existing pretrained models for detection, mask-based tracking, video captioning, and semantic retrieval to turn raw trajectories into summaries, instance captions, and interaction labels. The goal is to remove the need for expensive labeled data that limits current SMOT systems and to let new foundation models be swapped in quickly. On the BenSMOT benchmark the method matches or exceeds prior trained approaches on tracking accuracy and descriptive quality, though interaction labeling stays difficult when labels must match exactly. The central demonstration is that modular composition of off-the-shelf components can produce temporally coherent tracklets and human-readable scene descriptions.

Core claim

TF-SMOT builds semantic multi-object tracking by first using D-FINE to detect objects and SAM2 to generate temporally consistent mask tracklets, then applying contour grounding with InternVideo2.5 to produce video summaries and per-instance captions, and finally using LLM-based gloss retrieval to map extracted interaction predicates onto BenSMOT WordNet synsets. This composition reaches state-of-the-art tracking performance inside the SMOT setting and yields higher-quality summaries and captions than earlier end-to-end trained systems, while interaction recognition accuracy remains constrained by the fine-grained, long-tailed nature of the label space and by semantic overlap under strict one

What carries the argument

The training-free composition of detection, promptable mask tracking, contour-grounded video-language generation, and gloss-based LLM semantic retrieval into one pipeline.

If this is right

  • New foundation models for any of the subtasks can be inserted immediately without retraining the whole system.
  • Tracklets remain temporally consistent across frames through mask propagation rather than learned association.
  • Video summaries and instance captions are generated directly from contour-grounded visual features.
  • Interaction labels are produced by mapping visual predicates to a fixed semantic vocabulary via retrieval and disambiguation.
  • Performance on interaction recognition is bounded by the granularity and exact-match requirements of the WordNet label space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular pattern could be applied to other video understanding problems such as long-term action recognition or dynamic scene graph construction by exchanging the language-generation component.
  • Real-time deployment on edge devices would require measuring how the latency of the separate pretrained models adds up and whether any step can be approximated without losing tracklet stability.
  • If the individual models continue to improve, the performance gap between this training-free method and fully supervised systems is likely to shrink further on descriptive metrics.

Load-bearing premise

Pretrained models for detection, segmentation tracking, video captioning, and language-based predicate alignment can be directly chained without any joint optimization or domain adaptation and still produce consistent tracklets and accurate semantic outputs.

What would settle it

On a new video dataset containing unseen object categories or interaction types, the TF-SMOT pipeline produces fragmented tracklets or captions whose semantic content does not align with human annotations at a rate comparable to trained baselines.

Figures

Figures reproduced from arXiv: 2604.14074 by Elisa Ricci, Francesco Tonini, Laurence Bonat, Lorenzo Vaquero.

Figure 1
Figure 1. Figure 1: Comparison of training-based and training-free Semantic Multi￾Object Tracking (SMOT) paradigms. Top: Traditional SMOT systems like SMOTer [16] require task-specific end-to-end training on annotated datasets, coupling tracking and semantic prediction to supervised learning. Bottom: our TF-SMOT operates entirely at inference time by composing frozen, pretrained components to produce tracking outputs together… view at source ↗
Figure 2
Figure 2. Figure 2: TF-SMOT overview. The system decomposes SMOT into explicit functional stages with well-defined interfaces. The Person Tracking module (left) utilizes a person detector Φdet and mask-based tracker Φtrk to determine the person trajectories within the video. Subsequently, the Caption Generation module (right) renders the contours of the people to better ground them and uses a video captioner Φvlm to generate … view at source ↗
Figure 3
Figure 3. Figure 3: TF-SMOT semantic tracking qualitative results on BenSMOT (sequence jT 8i7edEt8). As can be seen, TF-SMOT detects a larger number of people in the scene than the ground truth annotations and generates richer, more detailed instance captions. The predicted instance-level interactions are correspondingly broader and more expressive; however, they do not always exactly match the annotated ground truth interact… view at source ↗
read the original abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TF-SMOT, a training-free pipeline for Semantic Multi-Object Tracking (SMOT) that composes off-the-shelf pretrained components: D-FINE for detection, SAM2 for promptable mask-based video tracking to produce temporally consistent tracklets, contour grounding to condition InternVideo2.5 for video summaries and instance captions, and LLM-based gloss retrieval for aligning interaction predicates to BenSMOT WordNet synsets. On the BenSMOT benchmark, the authors report state-of-the-art tracking performance within the SMOT setting along with improved summary and caption quality relative to prior trained systems, while noting that interaction recognition remains challenging under strict exact-match evaluation due to label granularity and semantic overlap.

Significance. If the empirical claims hold, the work is significant for demonstrating that semantic tracking tasks can be addressed via direct composition of foundation models without task-specific training or fine-tuning, which could accelerate adaptation to new models and interaction vocabularies. The training-free design, use of standard benchmarks, and explicit analysis of factors (semantic overlap, label granularity) affecting interaction performance are strengths that support reproducibility and extensibility.

major comments (2)
  1. [§3.2] §3.2 (Tracking Pipeline): The central claim that D-FINE detections can be directly used to prompt SAM2 for temporally consistent tracklets on BenSMOT rests on untested assumptions about ID maintenance under the dataset's motion and occlusion patterns; no ablation isolating SAM2's memory propagation (e.g., with vs. without contour prompts) is provided to confirm this step is load-bearing for the reported SOTA tracking gains.
  2. [§4.3] §4.3 (Interaction Recognition): The reported challenges under exact-match evaluation are acknowledged, but the paper does not quantify the accuracy of the LLM gloss-retrieval step itself (e.g., precision of synset alignment on a held-out subset); this omission makes it impossible to determine whether the performance gap is due to the training-free composition or to downstream label-space issues.
minor comments (2)
  1. [Abstract] Abstract and §4.1: The SOTA tracking claim should explicitly list the primary metrics (MOTA, IDF1, etc.) and the full set of compared SMOT baselines rather than referring only to 'prior art'.
  2. [Figure 2] Figure 2 and §3.3: The contour-grounding visualization would benefit from an additional panel showing failure cases where grounding leads to incorrect VLM conditioning, to illustrate the limits of the training-free approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. The comments highlight opportunities to strengthen the empirical support for our training-free pipeline. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Tracking Pipeline): The central claim that D-FINE detections can be directly used to prompt SAM2 for temporally consistent tracklets on BenSMOT rests on untested assumptions about ID maintenance under the dataset's motion and occlusion patterns; no ablation isolating SAM2's memory propagation (e.g., with vs. without contour prompts) is provided to confirm this step is load-bearing for the reported SOTA tracking gains.

    Authors: We agree that an explicit ablation would more rigorously demonstrate the contribution of SAM2's memory propagation to ID maintenance. While the pipeline design leverages SAM2's promptable tracking for temporal consistency following D-FINE detections, the original submission did not include a direct comparison. In the revised manuscript, we will add an ablation evaluating tracking metrics on BenSMOT with and without contour prompts to SAM2, isolating the effect of its memory mechanism on the reported SOTA performance. revision: yes

  2. Referee: [§4.3] §4.3 (Interaction Recognition): The reported challenges under exact-match evaluation are acknowledged, but the paper does not quantify the accuracy of the LLM gloss-retrieval step itself (e.g., precision of synset alignment on a held-out subset); this omission makes it impossible to determine whether the performance gap is due to the training-free composition or to downstream label-space issues.

    Authors: We appreciate the suggestion to better isolate error sources in interaction recognition. The manuscript already notes the impact of label granularity and semantic overlap in the WordNet space, but we did not separately measure the LLM gloss-retrieval precision. For the revision, we will evaluate and report the accuracy of the synset alignment step on a held-out subset of BenSMOT annotations. This will clarify the relative contributions of the training-free composition versus inherent label-space challenges. revision: yes

Circularity Check

0 steps flagged

No circularity; modular composition evaluated on external benchmark

full rationale

The paper presents TF-SMOT as a training-free pipeline that directly composes off-the-shelf pretrained models (D-FINE for detection, SAM2 for mask-based tracking, InternVideo2.5 for video-language generation, and LLM for gloss retrieval) without any internal parameter fitting, equations, or derivations. Performance claims (SOTA tracking on BenSMOT, improved summaries/captions) rest on empirical evaluation against an external benchmark using standard metrics, with no reduction of outputs to quantities defined by the paper's own inputs. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described chain; the approach is self-contained as an explicit modular assembly whose validity is tested externally rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that existing pretrained models are sufficiently general to be composed without adaptation; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Pretrained models D-FINE, SAM2, and InternVideo2.5 produce sufficiently accurate and temporally consistent outputs when used off-the-shelf on the target domain.
    The pipeline depends on direct composition without fine-tuning.
  • domain assumption LLM-based gloss retrieval can reliably align extracted predicates to WordNet synsets despite label granularity.
    Used for the interaction recognition stage.

pith-pipeline@v0.9.0 · 5512 in / 1288 out tokens · 56377 ms · 2026-05-10T14:07:55.769723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Abdar, M

    M. Abdar, M. Kollati, S. Kuraparthi, F. Pourpanah, D. McDuff, M. Ghavamzadeh, S. Yan, A. Mohamed, A. Khosravi, E. Cambria, et al. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  2. [2]

    Alayrac, J

    J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language mode...

  3. [3]

    Banerjee and A

    S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In J. Goldstein, A. Lavie, C.-Y . Lin, and C. V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association...

  4. [4]

    Bernardin and R

    K. Bernardin and R. Stiefelhagen. Evaluating Multiple Object Track- ing Performance: The CLEAR MOT Metrics.EURASIP Journal on Image and Video Processing, 2008:1–10, 2008

  5. [5]

    Bewley, Z

    A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016

  6. [6]

    arXiv preprint arXiv:2405.17247 , year=

    F. Bordes, R. Y . Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Ma ˜nas, Z. Lin, A. Mahmoud, B. Jayaraman, M. Ibrahim, M. Hall, Y . Xiong, J. Lebensold, C. Ross, S. Jayakumar, C. Guo, D. Boucha- court, H. Al-Tahan, K. Padthe, V . Sharma, H. Xu, X. E. Tan, M. Richards, S. Lavoie, P. Astolfi, R. A. Hemmat, J. Chen, K. Tiru- mala, R. Assouel, M. Moayeri,...

  7. [7]

    J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani. Observation- centric SORT: rethinking SORT for robust multi-object tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, V ancouver , BC, Canada, June 17-24, 2023, pages 9686–

  8. [8]

    Choudhuri, G

    A. Choudhuri, G. Chowdhary, and A. G. Schwing. Ow-viscaptor: Abstractors for open-world video instance segmentation and caption- ing. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIP...

  9. [9]

    Y . Dai, Z. Hu, S. Zhang, and L. Liu. A survey of detection-based video multi-object tracking.Displays, 75:102317, 2022

  10. [10]

    Y . Dong, C. F. Ruan, Y . Cai, R. Lai, Z. Xu, Y . Zhao, and T. Chen. Xgrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems 7, 2024

  11. [11]

    C. Gao, Y . Zou, and J. Huang. ican: Instance-centric attention network for human-object interaction detection. InBritish Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 41. BMV A Press, 2018

  12. [12]

    G. Han, J. Zhao, L. Zhang, and F. Deng. A Survey of Human-Object Interaction Detection With Deep Learning.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(1):3–26, 2025

  13. [13]

    Z. Hou, X. Peng, Y . Qiao, and D. Tao. Visual Compositional Learning for Human-Object Interaction Detection. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors,Computer Vision – ECCV 2020, volume 12360, pages 584–600, Cham, 2020. Springer International Publishing

  14. [14]

    R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems.Journal of Basic Engineering, 82(1):35–45, 1960

  15. [15]

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,International Conference on Ma- chine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedin...

  16. [16]

    Y . Li, Q. Li, H. Wang, X. Ma, J. Yao, S. Dong, H. Fan, and L. Zhang. Beyond MOT: Semantic Multi-Object Tracking, 2024

  17. [17]

    Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, 2023. Association for Computational Linguistics

  18. [18]

    C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain,

  19. [19]

    Association for Computational Linguistics

  20. [20]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26286–26296. IEEE, 2024

  21. [21]

    Luiten, A

    J. Luiten, A. O ˘sep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix´e, and B. Leibe. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking.Int J Comput Vis, 129(2):548–578, 2021

  22. [22]

    Meinhardt, A

    T. Meinhardt, A. Kirillov, L. Leal-Taix ´e, and C. Feichtenhofer. Track- Former: Multi-Object Tracking with Transformers. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8834–8844, 2022

  23. [23]

    G. A. Miller. WordNet: A lexical database for English. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992

  24. [24]

    Nguyen, K

    P. Nguyen, K. G. Quach, K. Kitani, and K. Luu. Type-to-track: Retrieve any object via prompt-based tracking. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -...

  25. [25]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics

  26. [26]

    Y . Peng, H. Li, P. Wu, Y . Zhang, X. Sun, and F. Wu. D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement. ArXiv preprint, abs/2410.13842, 2024

  27. [27]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors,Proceedings of the 38th Inter- national Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Even...

  28. [28]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment Anything in Images and Videos, 2024

  29. [29]

    Redmon, S

    J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016

  30. [30]

    S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Con- ference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Que...

  31. [31]

    Ristani, F

    E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Per- formance Measures and a Data Set for Multi-target, Multi-camera Tracking. In G. Hua and H. J ´egou, editors,Computer Vision – ECCV 2016 Workshops, volume 9914, pages 17–35, Cham, 2016. Springer International Publishing

  32. [32]

    P. Sun, Y . Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo. TransTrack: Multiple-Object Tracking with Transformer.ArXiv, 2020

  33. [33]

    L. Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024

  34. [34]

    Tonini, L

    F. Tonini, L. Vaquero, A. Conti, C. Beyan, and E. Ricci. Dynamic scoring with enhanced semantics for training-free human-object in- teraction detection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2801–2810, 2025

  35. [35]

    Vaquero, M

    L. Vaquero, M. Mucientes, and V . M. Brea. Siammt: Real-time arbitrary multi-object tracking. InICPR, pages 707–714. IEEE, 2020

  36. [36]

    Vaquero, Y

    L. Vaquero, Y . Xu, X. Alameda-Pineda, V . M. Brea, and M. Mucientes. Lost and found: Overcoming detector failures in online multi-object tracking. InECCV (73), volume 15131 ofLecture Notes in Computer Science, pages 448–466. Springer, 2024

  37. [37]

    Vedantam, C

    R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society, 2015

  38. [38]

    W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers. InNeurIPS, 2020

  39. [39]

    Y . Wang, K. Li, Y . Li, Y . He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y . Liu, Z. Wang, S. Xing, G. Chen, J. Pan, J. Yu, Y . Wang, L. Wang, and Y . Qiao. Internvideo: General video foundation models via generative and discriminative learning.CoRR, abs/2212.03191, 2022

  40. [40]

    Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao, M. Dou, K. Chen, W. Wang, Y . Qiao, Y . Wang, and L. Wang. InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling, 2025

  41. [41]

    X. Weng, Y . Wang, Y . Man, and K. M. Kitani. GNN3DMOT: graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 6498–6507. IEEE, 2020

  42. [42]

    Wojke, A

    N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649, 2017

  43. [43]

    H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. Videoclip: Contrastive pre- training for zero-shot video-text understanding. InEMNLP (1), pages 6787–6800. Association for Computational Linguistics, 2021

  44. [44]

    Yilmaz, O

    A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey.ACM Comput. Surv., 38(4):13–es, 2006

  45. [45]

    F. Zeng, B. Dong, Y . Zhang, T. Wang, X. Zhang, and Y . Wei. MOTR: End-to-End Multiple-Object Tracking with Transformer.Lecture Notes in Computer Science, pages 659–675, 2022

  46. [46]

    Zhang, P

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. ByteTrack: Multi-object Tracking by Associating Every Detection Box.Computer Vision – ECCV 2022, 13682:1–21, 2022

  47. [47]

    Zhang, T

    Y . Zhang, T. Wang, and X. Zhang. Motrv2: Bootstrapping end-to- end multi-object tracking by pretrained object detectors. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, V ancouver , BC, Canada, June 17-24, 2023, pages 22056–22065. IEEE, 2023

  48. [48]

    X. Zhou, A. Arnab, C. Sun, and C. Schmid. Dense Video Object Captioning from Disjoint Supervision, 2023

  49. [49]

    look"‘, ‘

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt- 4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learn- ing Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. Training-Free Semantic Multi-Object Tracking with Vision-Language Models Suppleme...