Recognition: unknown
Training-Free Semantic Multi-Object Tracking with Vision-Language Models
Pith reviewed 2026-05-10 14:07 UTC · model grok-4.3
The pith
A training-free pipeline composes pretrained vision and language models to track multiple objects and generate semantic descriptions in video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TF-SMOT builds semantic multi-object tracking by first using D-FINE to detect objects and SAM2 to generate temporally consistent mask tracklets, then applying contour grounding with InternVideo2.5 to produce video summaries and per-instance captions, and finally using LLM-based gloss retrieval to map extracted interaction predicates onto BenSMOT WordNet synsets. This composition reaches state-of-the-art tracking performance inside the SMOT setting and yields higher-quality summaries and captions than earlier end-to-end trained systems, while interaction recognition accuracy remains constrained by the fine-grained, long-tailed nature of the label space and by semantic overlap under strict one
What carries the argument
The training-free composition of detection, promptable mask tracking, contour-grounded video-language generation, and gloss-based LLM semantic retrieval into one pipeline.
If this is right
- New foundation models for any of the subtasks can be inserted immediately without retraining the whole system.
- Tracklets remain temporally consistent across frames through mask propagation rather than learned association.
- Video summaries and instance captions are generated directly from contour-grounded visual features.
- Interaction labels are produced by mapping visual predicates to a fixed semantic vocabulary via retrieval and disambiguation.
- Performance on interaction recognition is bounded by the granularity and exact-match requirements of the WordNet label space.
Where Pith is reading between the lines
- The same modular pattern could be applied to other video understanding problems such as long-term action recognition or dynamic scene graph construction by exchanging the language-generation component.
- Real-time deployment on edge devices would require measuring how the latency of the separate pretrained models adds up and whether any step can be approximated without losing tracklet stability.
- If the individual models continue to improve, the performance gap between this training-free method and fully supervised systems is likely to shrink further on descriptive metrics.
Load-bearing premise
Pretrained models for detection, segmentation tracking, video captioning, and language-based predicate alignment can be directly chained without any joint optimization or domain adaptation and still produce consistent tracklets and accurate semantic outputs.
What would settle it
On a new video dataset containing unseen object categories or interaction types, the TF-SMOT pipeline produces fragmented tracklets or captions whose semantic content does not align with human annotations at a rate comparable to trained baselines.
Figures
read the original abstract
Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TF-SMOT, a training-free pipeline for Semantic Multi-Object Tracking (SMOT) that composes off-the-shelf pretrained components: D-FINE for detection, SAM2 for promptable mask-based video tracking to produce temporally consistent tracklets, contour grounding to condition InternVideo2.5 for video summaries and instance captions, and LLM-based gloss retrieval for aligning interaction predicates to BenSMOT WordNet synsets. On the BenSMOT benchmark, the authors report state-of-the-art tracking performance within the SMOT setting along with improved summary and caption quality relative to prior trained systems, while noting that interaction recognition remains challenging under strict exact-match evaluation due to label granularity and semantic overlap.
Significance. If the empirical claims hold, the work is significant for demonstrating that semantic tracking tasks can be addressed via direct composition of foundation models without task-specific training or fine-tuning, which could accelerate adaptation to new models and interaction vocabularies. The training-free design, use of standard benchmarks, and explicit analysis of factors (semantic overlap, label granularity) affecting interaction performance are strengths that support reproducibility and extensibility.
major comments (2)
- [§3.2] §3.2 (Tracking Pipeline): The central claim that D-FINE detections can be directly used to prompt SAM2 for temporally consistent tracklets on BenSMOT rests on untested assumptions about ID maintenance under the dataset's motion and occlusion patterns; no ablation isolating SAM2's memory propagation (e.g., with vs. without contour prompts) is provided to confirm this step is load-bearing for the reported SOTA tracking gains.
- [§4.3] §4.3 (Interaction Recognition): The reported challenges under exact-match evaluation are acknowledged, but the paper does not quantify the accuracy of the LLM gloss-retrieval step itself (e.g., precision of synset alignment on a held-out subset); this omission makes it impossible to determine whether the performance gap is due to the training-free composition or to downstream label-space issues.
minor comments (2)
- [Abstract] Abstract and §4.1: The SOTA tracking claim should explicitly list the primary metrics (MOTA, IDF1, etc.) and the full set of compared SMOT baselines rather than referring only to 'prior art'.
- [Figure 2] Figure 2 and §3.3: The contour-grounding visualization would benefit from an additional panel showing failure cases where grounding leads to incorrect VLM conditioning, to illustrate the limits of the training-free approach.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. The comments highlight opportunities to strengthen the empirical support for our training-free pipeline. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Tracking Pipeline): The central claim that D-FINE detections can be directly used to prompt SAM2 for temporally consistent tracklets on BenSMOT rests on untested assumptions about ID maintenance under the dataset's motion and occlusion patterns; no ablation isolating SAM2's memory propagation (e.g., with vs. without contour prompts) is provided to confirm this step is load-bearing for the reported SOTA tracking gains.
Authors: We agree that an explicit ablation would more rigorously demonstrate the contribution of SAM2's memory propagation to ID maintenance. While the pipeline design leverages SAM2's promptable tracking for temporal consistency following D-FINE detections, the original submission did not include a direct comparison. In the revised manuscript, we will add an ablation evaluating tracking metrics on BenSMOT with and without contour prompts to SAM2, isolating the effect of its memory mechanism on the reported SOTA performance. revision: yes
-
Referee: [§4.3] §4.3 (Interaction Recognition): The reported challenges under exact-match evaluation are acknowledged, but the paper does not quantify the accuracy of the LLM gloss-retrieval step itself (e.g., precision of synset alignment on a held-out subset); this omission makes it impossible to determine whether the performance gap is due to the training-free composition or to downstream label-space issues.
Authors: We appreciate the suggestion to better isolate error sources in interaction recognition. The manuscript already notes the impact of label granularity and semantic overlap in the WordNet space, but we did not separately measure the LLM gloss-retrieval precision. For the revision, we will evaluate and report the accuracy of the synset alignment step on a held-out subset of BenSMOT annotations. This will clarify the relative contributions of the training-free composition versus inherent label-space challenges. revision: yes
Circularity Check
No circularity; modular composition evaluated on external benchmark
full rationale
The paper presents TF-SMOT as a training-free pipeline that directly composes off-the-shelf pretrained models (D-FINE for detection, SAM2 for mask-based tracking, InternVideo2.5 for video-language generation, and LLM for gloss retrieval) without any internal parameter fitting, equations, or derivations. Performance claims (SOTA tracking on BenSMOT, improved summaries/captions) rest on empirical evaluation against an external benchmark using standard metrics, with no reduction of outputs to quantities defined by the paper's own inputs. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described chain; the approach is self-contained as an explicit modular assembly whose validity is tested externally rather than by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained models D-FINE, SAM2, and InternVideo2.5 produce sufficiently accurate and temporally consistent outputs when used off-the-shelf on the target domain.
- domain assumption LLM-based gloss retrieval can reliably align extracted predicates to WordNet synsets despite label granularity.
Reference graph
Works this paper leans on
-
[1]
Abdar, M
M. Abdar, M. Kollati, S. Kuraparthi, F. Pourpanah, D. McDuff, M. Ghavamzadeh, S. Yan, A. Mohamed, A. Khosravi, E. Cambria, et al. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
2024
-
[2]
Alayrac, J
J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language mode...
2022
-
[3]
Banerjee and A
S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In J. Goldstein, A. Lavie, C.-Y . Lin, and C. V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association...
2005
-
[4]
Bernardin and R
K. Bernardin and R. Stiefelhagen. Evaluating Multiple Object Track- ing Performance: The CLEAR MOT Metrics.EURASIP Journal on Image and Video Processing, 2008:1–10, 2008
2008
-
[5]
Bewley, Z
A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016
2016
-
[6]
arXiv preprint arXiv:2405.17247 , year=
F. Bordes, R. Y . Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Ma ˜nas, Z. Lin, A. Mahmoud, B. Jayaraman, M. Ibrahim, M. Hall, Y . Xiong, J. Lebensold, C. Ross, S. Jayakumar, C. Guo, D. Boucha- court, H. Al-Tahan, K. Padthe, V . Sharma, H. Xu, X. E. Tan, M. Richards, S. Lavoie, P. Astolfi, R. A. Hemmat, J. Chen, K. Tiru- mala, R. Assouel, M. Moayeri,...
-
[7]
J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani. Observation- centric SORT: rethinking SORT for robust multi-object tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, V ancouver , BC, Canada, June 17-24, 2023, pages 9686–
2023
-
[8]
Choudhuri, G
A. Choudhuri, G. Chowdhary, and A. G. Schwing. Ow-viscaptor: Abstractors for open-world video instance segmentation and caption- ing. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIP...
2024
-
[9]
Y . Dai, Z. Hu, S. Zhang, and L. Liu. A survey of detection-based video multi-object tracking.Displays, 75:102317, 2022
2022
-
[10]
Y . Dong, C. F. Ruan, Y . Cai, R. Lai, Z. Xu, Y . Zhao, and T. Chen. Xgrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems 7, 2024
2024
-
[11]
C. Gao, Y . Zou, and J. Huang. ican: Instance-centric attention network for human-object interaction detection. InBritish Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 41. BMV A Press, 2018
2018
-
[12]
G. Han, J. Zhao, L. Zhang, and F. Deng. A Survey of Human-Object Interaction Detection With Deep Learning.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(1):3–26, 2025
2025
-
[13]
Z. Hou, X. Peng, Y . Qiao, and D. Tao. Visual Compositional Learning for Human-Object Interaction Detection. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors,Computer Vision – ECCV 2020, volume 12360, pages 584–600, Cham, 2020. Springer International Publishing
2020
-
[14]
R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems.Journal of Basic Engineering, 82(1):35–45, 1960
1960
-
[15]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,International Conference on Ma- chine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedin...
2023
-
[16]
Y . Li, Q. Li, H. Wang, X. Ma, J. Yao, S. Dong, H. Fan, and L. Zhang. Beyond MOT: Semantic Multi-Object Tracking, 2024
2024
-
[17]
Z. Li, Y . Chai, T. Y . Zhuo, L. Qu, G. Haffari, F. Li, D. Ji, and Q. H. Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, 2023. Association for Computational Linguistics
2023
-
[18]
C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain,
-
[19]
Association for Computational Linguistics
-
[20]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26286–26296. IEEE, 2024
2024
-
[21]
Luiten, A
J. Luiten, A. O ˘sep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix´e, and B. Leibe. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking.Int J Comput Vis, 129(2):548–578, 2021
2021
-
[22]
Meinhardt, A
T. Meinhardt, A. Kirillov, L. Leal-Taix ´e, and C. Feichtenhofer. Track- Former: Multi-Object Tracking with Transformers. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8834–8844, 2022
2022
-
[23]
G. A. Miller. WordNet: A lexical database for English. InSpeech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992
1992
-
[24]
Nguyen, K
P. Nguyen, K. G. Quach, K. Kitani, and K. Luu. Type-to-track: Retrieve any object via prompt-based tracking. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -...
2023
-
[25]
Papineni, S
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics
2002
- [26]
-
[27]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors,Proceedings of the 38th Inter- national Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Even...
2021
-
[28]
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment Anything in Images and Videos, 2024
2024
-
[29]
Redmon, S
J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016
2016
-
[30]
S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Con- ference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Que...
2015
-
[31]
Ristani, F
E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Per- formance Measures and a Data Set for Multi-target, Multi-camera Tracking. In G. Hua and H. J ´egou, editors,Computer Vision – ECCV 2016 Workshops, volume 9914, pages 17–35, Cham, 2016. Springer International Publishing
2016
-
[32]
P. Sun, Y . Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo. TransTrack: Multiple-Object Tracking with Transformer.ArXiv, 2020
2020
-
[33]
L. Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Tonini, L
F. Tonini, L. Vaquero, A. Conti, C. Beyan, and E. Ricci. Dynamic scoring with enhanced semantics for training-free human-object in- teraction detection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2801–2810, 2025
2025
-
[35]
Vaquero, M
L. Vaquero, M. Mucientes, and V . M. Brea. Siammt: Real-time arbitrary multi-object tracking. InICPR, pages 707–714. IEEE, 2020
2020
-
[36]
Vaquero, Y
L. Vaquero, Y . Xu, X. Alameda-Pineda, V . M. Brea, and M. Mucientes. Lost and found: Overcoming detector failures in online multi-object tracking. InECCV (73), volume 15131 ofLecture Notes in Computer Science, pages 448–466. Springer, 2024
2024
-
[37]
Vedantam, C
R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society, 2015
2015
-
[38]
W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers. InNeurIPS, 2020
2020
-
[39]
Y . Wang, K. Li, Y . Li, Y . He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y . Liu, Z. Wang, S. Xing, G. Chen, J. Pan, J. Yu, Y . Wang, L. Wang, and Y . Qiao. Internvideo: General video foundation models via generative and discriminative learning.CoRR, abs/2212.03191, 2022
work page internal anchor Pith review arXiv 2022
-
[40]
Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao, M. Dou, K. Chen, W. Wang, Y . Qiao, Y . Wang, and L. Wang. InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling, 2025
2025
-
[41]
X. Weng, Y . Wang, Y . Man, and K. M. Kitani. GNN3DMOT: graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 6498–6507. IEEE, 2020
2020
-
[42]
Wojke, A
N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649, 2017
2017
-
[43]
H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer. Videoclip: Contrastive pre- training for zero-shot video-text understanding. InEMNLP (1), pages 6787–6800. Association for Computational Linguistics, 2021
2021
-
[44]
Yilmaz, O
A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey.ACM Comput. Surv., 38(4):13–es, 2006
2006
-
[45]
F. Zeng, B. Dong, Y . Zhang, T. Wang, X. Zhang, and Y . Wei. MOTR: End-to-End Multiple-Object Tracking with Transformer.Lecture Notes in Computer Science, pages 659–675, 2022
2022
-
[46]
Zhang, P
Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. ByteTrack: Multi-object Tracking by Associating Every Detection Box.Computer Vision – ECCV 2022, 13682:1–21, 2022
2022
-
[47]
Zhang, T
Y . Zhang, T. Wang, and X. Zhang. Motrv2: Bootstrapping end-to- end multi-object tracking by pretrained object detectors. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, V ancouver , BC, Canada, June 17-24, 2023, pages 22056–22065. IEEE, 2023
2023
-
[48]
X. Zhou, A. Arnab, C. Sun, and C. Schmid. Dense Video Object Captioning from Disjoint Supervision, 2023
2023
-
[49]
look"‘, ‘
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt- 4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learn- ing Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. Training-Free Semantic Multi-Object Tracking with Vision-Language Models Suppleme...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.