pith. machine review for the scientific record. sign in

arxiv: 2604.07740 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video person re-identificationCLIPmulti-modal large language modelscaption guidancespatiotemporal featureshigh-difficulty ReIDSportsVReIDDanceVReID
0
0 comments X

The pith

The CG-CLIP framework outperforms prior methods for video person re-identification in high-difficulty scenarios by using MLLM captions to refine features and learnable tokens to aggregate spatiotemporal information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video person re-identification matches individuals across non-overlapping camera views by exploiting motion and appearance cues, yet existing approaches break down when multiple people wear similar clothing during fast or complex actions such as sports or dance. The paper introduces CG-CLIP, which adds explicit textual captions generated by multi-modal large language models to highlight identity-specific details and pairs them with fixed-length learnable tokens that efficiently combine spatial and temporal visual features. Two modules carry the work: Caption-guided Memory Refinement improves feature discriminability and Token-based Feature Extraction reduces computational cost while preserving sequence information. A sympathetic reader would care because accurate cross-camera tracking matters for security, event monitoring, and automated analysis whenever visual similarity alone is insufficient.

Core claim

The CG-CLIP method introduces Caption-guided Memory Refinement (CMR) to refine identity-specific features using captions from multi-modal large language models and Token-based Feature Extraction (TFE) that applies cross-attention over fixed-length learnable tokens to aggregate spatiotemporal information, yielding higher matching accuracy than current state-of-the-art methods on the MARS, iLIDS-VID, SportsVReID, and DanceVReID datasets.

What carries the argument

Caption-guided Memory Refinement (CMR) that updates identity features from MLLM text plus Token-based Feature Extraction (TFE) that compresses video sequences via cross-attention on learnable tokens inside a CLIP backbone.

If this is right

  • The method records higher accuracy than prior approaches on the standard MARS and iLIDS-VID benchmarks.
  • It delivers larger gains on the new high-difficulty SportsVReID and DanceVReID collections that feature similar outfits and rapid motion.
  • Token-based Feature Extraction lowers computational overhead while still integrating full video sequences.
  • The combined visual-text pipeline succeeds at distinguishing identities when appearance cues alone are ambiguous.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caption-refinement pattern could transfer to other video tasks such as action recognition or group tracking where text disambiguates visually similar events.
  • Real-time deployment would require lightweight or on-device MLLM captioning to keep latency acceptable for live surveillance.
  • A hybrid mode that falls back to pure visual processing when caption quality is low could widen applicability across varied lighting and camera angles.

Load-bearing premise

That captions automatically generated by MLLMs reliably capture fine-grained, identity-specific details without introducing noise or bias in dynamic, similar-clothing scenes.

What would settle it

Replace MLLM captions with random or generic text on the SportsVReID or DanceVReID datasets and measure whether rank-1 accuracy and mAP remain statistically unchanged; unchanged performance would falsify the necessity of accurate caption guidance.

Figures

Figures reproduced from arXiv: 2604.07740 by Sayaka Nakamura, Shogo Hamano, Shunya Wakasugi, Tatsuhito Sato.

Figure 1
Figure 1. Figure 1: Two basketball players with similar appearances. They [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of existing approaches and our method. (a) [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our proposed CG-CLIP framework. Caption-guided Memory Refinement (CMR) refines Image Memory based [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed Token-based Feature Extrac [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–latency trade-off on DanceVReID [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CG-CLIP, a caption-guided CLIP framework for video-based person re-identification that addresses high-difficulty scenarios (e.g., sports and dance with similar clothing and dynamic motion). It introduces two components: Caption-guided Memory Refinement (CMR), which uses MLLM-generated textual descriptions to refine identity-specific features, and Token-based Feature Extraction (TFE), which employs cross-attention with fixed-length learnable tokens to aggregate spatiotemporal features. The method is evaluated on MARS, iLIDS-VID, and two newly introduced datasets (SportsVReID and DanceVReID), with the claim that it outperforms current state-of-the-art approaches across all benchmarks.

Significance. If the reported gains are robust and attributable to the caption-guided mechanism, the work could meaningfully extend video ReID beyond pedestrian scenarios by demonstrating the value of explicit multimodal (textual) cues for disambiguating visually similar identities under motion. The introduction of SportsVReID and DanceVReID as challenging benchmarks is a positive contribution that could stimulate further research in this direction.

major comments (3)
  1. [§3.2] §3.2 (CMR description): The central performance claim rests on MLLM captions supplying reliable fine-grained identity cues (pose, gait, accessories) that visual features miss in uniform-clothing dynamic scenes. However, no direct validation of caption fidelity—such as human evaluation scores, error analysis on the new datasets, or ablation comparing CMR with vs. without caption conditioning—is provided. Without this, gains cannot be confidently attributed to the proposed caption-guided refinement rather than TFE or other factors.
  2. [§4.1] §4.1 (Datasets): The construction, sourcing, annotation protocol, and caption-generation procedure for the new SportsVReID and DanceVReID datasets are not described in sufficient detail (e.g., number of identities, video lengths, how MLLM prompts were designed, or inter-annotator agreement). This undermines reproducibility and the claim that these sets specifically test “high-difficulty” cases where captions add value.
  3. [§4.3] §4.3 (Experimental results): The abstract and results section state that the method “outperforms current state-of-the-art approaches, achieving significant improvements,” yet the provided text contains no quantitative numbers, baseline details, standard deviations, or statistical tests. Load-bearing tables (presumably Table 1–3) must be examined for effect sizes and whether improvements hold after controlling for caption quality.
minor comments (2)
  1. [§3.3] Notation for learnable tokens in TFE (e.g., dimension and count) should be explicitly defined with symbols in §3.3 to avoid ambiguity with CLIP’s original token embeddings.
  2. [§3.2] The paper should clarify whether MLLM caption generation is performed once offline or dynamically, and report any associated computational overhead, as this affects the practicality claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will make the corresponding revisions to strengthen the manuscript's clarity, reproducibility, and evidential support.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (CMR description): The central performance claim rests on MLLM captions supplying reliable fine-grained identity cues (pose, gait, accessories) that visual features miss in uniform-clothing dynamic scenes. However, no direct validation of caption fidelity—such as human evaluation scores, error analysis on the new datasets, or ablation comparing CMR with vs. without caption conditioning—is provided. Without this, gains cannot be confidently attributed to the proposed caption-guided refinement rather than TFE or other factors.

    Authors: We agree that direct validation of caption quality would better support attribution of gains specifically to CMR. The current manuscript reports end-to-end results but does not include the requested ablations or fidelity metrics. In the revised version we will add an ablation isolating CMR with versus without caption conditioning, plus qualitative caption examples from SportsVReID and DanceVReID together with a short error analysis of common MLLM failure modes on those sets. revision: yes

  2. Referee: [§4.1] §4.1 (Datasets): The construction, sourcing, annotation protocol, and caption-generation procedure for the new SportsVReID and DanceVReID datasets are not described in sufficient detail (e.g., number of identities, video lengths, how MLLM prompts were designed, or inter-annotator agreement). This undermines reproducibility and the claim that these sets specifically test “high-difficulty” cases where captions add value.

    Authors: We accept that the dataset section lacks sufficient detail for reproducibility. The revised manuscript will expand Section 4.1 with the number of identities and videos per dataset, average and range of video lengths, data sourcing and collection protocol, identity annotation procedure, any inter-annotator agreement statistics, and the exact prompt templates used for MLLM caption generation. These additions will also clarify why the new sets emphasize high-difficulty motion and clothing similarity. revision: yes

  3. Referee: [§4.3] §4.3 (Experimental results): The abstract and results section state that the method “outperforms current state-of-the-art approaches, achieving significant improvements,” yet the provided text contains no quantitative numbers, baseline details, standard deviations, or statistical tests. Load-bearing tables (presumably Table 1–3) must be examined for effect sizes and whether improvements hold after controlling for caption quality.

    Authors: The numerical results, baselines, and comparisons appear in Tables 1–3. To make the claims self-contained, the revised Section 4.3 will explicitly quote the key mAP and Rank-1 figures (with standard deviations where available) in the text, note the magnitude of improvements over the strongest baselines, and briefly discuss that the largest gains occur on the new high-difficulty sets. We will also add a short paragraph addressing the contribution of caption quality via the planned CMR ablation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method with no derivation chain or self-referential fitting.

full rationale

The paper presents an applied CV framework (CG-CLIP with CMR and TFE modules) whose central claims rest on experimental outperformance on MARS, iLIDS-VID, SportsVReID and DanceVReID. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing arguments appear in the provided text. Caption quality from MLLMs is treated as an external input rather than derived or validated within the paper itself; performance deltas are attributed to the proposed architecture via direct benchmarking. This is a standard empirical contribution with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on pre-trained CLIP and MLLM models treated as fixed oracles plus two new learnable modules whose internal parameters are not quantified in the abstract.

free parameters (1)
  • learnable token count and dimension
    TFE uses fixed-length learnable tokens whose exact size and training details are not specified.
axioms (1)
  • domain assumption MLLM-generated captions provide accurate, identity-discriminative descriptions even under motion blur and clothing similarity
    Invoked by CMR to refine features from video input.

pith-pipeline@v0.9.0 · 5514 in / 1273 out tokens · 34673 ms · 2026-05-10T18:30:01.521048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Ben- haim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multi- modal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025. 2, 3, 6, 1

  2. [2]

    Spatio- temporal representation factorization for video-based person re-identification

    Abhishek Aich, Meng Zheng, Srikrishna Karanam, Terrence Chen, Amit K Roy-Chowdhury, and Ziyan Wu. Spatio- temporal representation factorization for video-based person re-identification. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 152– 162, 2021. 2

  3. [3]

    Vid-trans-reid: En- hanced video transformers for person re-identification

    Aishah Alsehaim and Toby P Breckon. Vid-trans-reid: En- hanced video transformers for person re-identification. In BMVC, page 342, 2022. 1

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 3

  5. [5]

    Salient-to-broad transition for video person re- identification

    Shutao Bai, Bingpeng Ma, Hong Chang, Rui Huang, and Xilin Chen. Salient-to-broad transition for video person re- identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7339–7348, 2022. 6

  6. [6]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

  7. [7]

    Saliency and granularity: Discovering temporal coherence for video-based person re-identification

    Cuiqun Chen, Mang Ye, Meibin Qi, Jingjing Wu, Yimin Liu, and Jianguo Jiang. Saliency and granularity: Discovering temporal coherence for video-based person re-identification. IEEE Transactions on Circuits and Systems for Video Tech- nology, 32(9):6100–6112, 2022. 2, 3

  8. [8]

    Video person re-identification with compet- itive snippet-similarity aggregation and co-attentive snippet embedding

    Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xi- aogang Wang. Video person re-identification with compet- itive snippet-similarity aggregation and co-attentive snippet embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1169–1178, 2018. 1, 2

  9. [9]

    Sportsmot: A large multi- object tracking dataset in multiple sports scenes

    Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. Sportsmot: A large multi- object tracking dataset in multiple sports scenes. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 9921–9931, 2023. 2, 6, 1

  10. [10]

    Video person re-identification by temporal residual learning.IEEE Transactions on Image Processing, 28(3):1366–1377, 2018

    Ju Dai, Pingping Zhang, Dong Wang, Huchuan Lu, and Hongyu Wang. Video person re-identification by temporal residual learning.IEEE Transactions on Image Processing, 28(3):1366–1377, 2018. 1, 2

  11. [11]

    Contrastive learning for multi-object tracking with transformers

    Pierre-Franc ¸ois De Plaen, Nicola Marinello, Marc Proes- mans, Tinne Tuytelaars, and Luc Van Gool. Contrastive learning for multi-object tracking with transformers. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6867–6877, 2024. 1

  12. [12]

    Video-based person re-identification with spatial and tempo- ral memory networks

    Chanho Eom, Geon Lee, Junghyup Lee, and Bumsub Ham. Video-based person re-identification with spatial and tempo- ral memory networks. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12036–12045, 2021. 6

  13. [13]

    Sta: Spatial-temporal attention for large-scale video-based person re-identification

    Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. Sta: Spatial-temporal attention for large-scale video-based person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8287– 8294, 2019. 1, 2

  14. [14]

    Anthony Chan, He Zhu, Hongwei Kan, Jiaming Chu, Jianming Hu, Jianyang Gu, Jin Chen, Jo˜ao V

    Silvio Giancola, Anthony Cioppa, Adrien Deli `ege, Flori- ane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdulrahman Darwish, Adrien Maglo, Albert Clap ´es, Andreas Luyts, An- drei Boiarov, Artur Xarles, Astrid Orcesi, Avijit Shah, Baoyu Fan, Bharath Com...

  15. [15]

    Appearance-preserving 3d convolution for video-based person re-identification

    Xinqian Gu, Hong Chang, Bingpeng Ma, Hongkai Zhang, and Xilin Chen. Appearance-preserving 3d convolution for video-based person re-identification. InEuropean Confer- ence on Computer Vision (ECCV), pages 228–243. Springer,

  16. [16]

    Motion feature aggregation for video-based person re-identification.IEEE Transactions on Image Processing, 31:3908–3919, 2022

    Xinqian Gu, Hong Chang, Bingpeng Ma, and Shiguang Shan. Motion feature aggregation for video-based person re-identification.IEEE Transactions on Image Processing, 31:3908–3919, 2022. 6

  17. [17]

    Clip-scgi: Syn- thesized caption-guided inversion for person re-identification.arXiv preprint arXiv:2410.09382, 2024

    Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai Xiang. Clip-scgi: Synthesized caption- guided inversion for person re-identification.arXiv preprint arXiv:2410.09382, 2024. 3

  18. [18]

    Transreid: Transformer-based object re- identification

    Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 15013–15022, 2021. 2

  19. [19]

    Dense interaction learning for video-based person re-identification

    Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. Dense interaction learning for video-based person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1490–1501, 2021. 3

  20. [20]

    In de- fense of the triplet loss for person re-identification.arXiv preprint arXiv:1703.07737, 2017

    Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de- fense of the triplet loss for person re-identification.arXiv preprint arXiv:1703.07737, 2017. 4, 5

  21. [21]

    Person re-identification by descriptive and discrim- inative classification

    Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst Bischof. Person re-identification by descriptive and discrim- inative classification. InScandinavian Conference on Image Analysis, pages 91–102. Springer, 2011. 2

  22. [22]

    Temporal complementary learning for video person re-identification

    Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Temporal complementary learning for video person re-identification. InEuropean Conference on Com- puter Vision (ECCV), pages 388–405. Springer, 2020. 6

  23. [23]

    Bicnet-tks: Learning efficient spatial- temporal representation for video person re-identification

    Ruibing Hou, Hong Chang, Bingpeng Ma, Rui Huang, and Shiguang Shan. Bicnet-tks: Learning efficient spatial- temporal representation for video person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2014–2023,

  24. [24]

    Iterative scale-up expansioniou and deep features association for multi-object tracking in sports

    Hsiang-Wei Huang, Cheng-Yen Yang, Jiacheng Sun, Pyong- Kun Kim, Kwang-Ju Kim, Kyoungoh Lee, Chung-I Huang, and Jenq-Neng Hwang. Iterative scale-up expansioniou and deep features association for multi-object tracking in sports. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 163–172, 2024. 2

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1

  26. [26]

    Prototypical contrastive learning-based clip fine-tuning for object re-identification

    Jiachen Li and Xiaojin Gong. Prototypical contrastive learning-based clip fine-tuning for object re-identification. arXiv preprint arXiv:2310.17218, 2023. 1, 3, 5

  27. [27]

    Global-local temporal representations for video per- son re-identification

    Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. Global-local temporal representations for video per- son re-identification. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3958–3967, 2019. 2

  28. [28]

    Multi- scale 3d convolution network for video based person re- identification

    Jianing Li, Shiliang Zhang, and Tiejun Huang. Multi- scale 3d convolution network for video based person re- identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8618–8625, 2019. 2

  29. [29]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning (ICML), pages 19730–19742. PMLR, 2023. 3

  30. [30]

    Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

    Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1405–1413, 2023. 1, 3, 7

  31. [31]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  32. [32]

    Spatial-temporal correlation and topology learn- ing for person re-identification in videos

    Jiawei Liu, Zheng-Jun Zha, Wei Wu, Kecheng Zheng, and Qibin Sun. Spatial-temporal correlation and topology learn- ing for person re-identification in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4370–4379, 2021. 6

  33. [33]

    A spatio-temporal appearance representation for viceo-based pedestrian re-identification

    Kan Liu, Bingpeng Ma, Wei Zhang, and Rui Huang. A spatio-temporal appearance representation for viceo-based pedestrian re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3810–3818, 2015. 1

  34. [34]

    Watching you: Global-guided recipro- cal learning for video-based person re-identification

    Xuehu Liu, Pingping Zhang, Chenyang Yu, Huchuan Lu, and Xiaoyun Yang. Watching you: Global-guided recipro- cal learning for video-based person re-identification. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13334–13343, 2021. 1, 2, 6

  35. [35]

    Xuehu Liu, Chenyang Yu, Pingping Zhang, and Huchuan Lu. Deeply coupled convolution–transformer with spatial– temporal complementary learning for video-based person re- identification.IEEE Transactions on Neural Networks and Learning Systems, 35(10):13753–13763, 2023. 6

  36. [36]

    Video-based person re-identification with long short-term representation learning

    Xuehu Liu, Pingping Zhang, and Huchuan Lu. Video-based person re-identification with long short-term representation learning. InInternational Conference on Image and Graph- ics, pages 55–67, 2023. 3, 6

  37. [37]

    A video is worth three views: Trigeminal transformers for video-based person re- identification.IEEE Transactions on Intelligent Transporta- tion Systems, 25(9):12818–12828, 2024

    Xuehu Liu, Pingping Zhang, Chenyang Yu, Xuesheng Qian, Xiaoyun Yang, and Huchuan Lu. A video is worth three views: Trigeminal transformers for video-based person re- identification.IEEE Transactions on Intelligent Transporta- tion Systems, 25(9):12818–12828, 2024. 1, 3, 6, 7

  38. [38]

    Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008. 4

  39. [39]

    Recurrent convolutional network for video-based person re-identification

    Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1325–1334, 2016. 2

  40. [40]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 1, 3

  41. [41]

    Gta: Global track- let association for multi-object tracking in sports

    Jiacheng Sun, Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, and Jenq-Neng Hwang. Gta: Global track- let association for multi-object tracking in sports. InPro- ceedings of the Asian Conference on Computer Vision, pages 421–434, 2024. 2

  42. [42]

    Dancetrack: Multi-object track- ing in uniform appearance and diverse motion

    Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object track- ing in uniform appearance and diverse motion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20993–21002, 2022. 2, 6, 1

  43. [43]

    Multi-stage spatio-temporal aggregation trans- former for video person re-identification.IEEE Transactions on Multimedia, 25:7917–7929, 2022

    Ziyi Tang, Ruimao Zhang, Zhanglin Peng, Jinrui Chen, and Liang Lin. Multi-stage spatio-temporal aggregation trans- former for video person re-identification.IEEE Transactions on Multimedia, 25:7917–7929, 2022. 1, 2

  44. [44]

    Deepsportradar-v1: Computer vision dataset for sports understanding with high quality annotations

    Gabriel Van Zandycke, Vladimir Somers, Maxime Istasse, Carlo Del Don, and Davide Zambrano. Deepsportradar-v1: Computer vision dataset for sports understanding with high quality annotations. InProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, pages 1–8, 2022. 2

  45. [45]

    Person re-identification by video ranking

    Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by video ranking. InEuro- pean Conference on Computer Vision (ECCV), pages 688–

  46. [46]

    2, 5, 6, 1

    Springer, 2014. 2, 5, 6, 1

  47. [47]

    Pyramid spatial-temporal aggregation for video-based person re-identification

    Yingquan Wang, Pingping Zhang, Shang Gao, Xia Geng, Hu Lu, and Dong Wang. Pyramid spatial-temporal aggregation for video-based person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12026–12035, 2021. 6

  48. [48]

    Top-reid: Multi-spectral object re-identification with token permutation

    Yuhao Wang, Xuehu Liu, Pingping Zhang, Hu Lu, Zhengzheng Tu, and Huchuan Lu. Top-reid: Multi-spectral object re-identification with token permutation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 5758–5766, 2024. 2

  49. [49]

    Person transfer gan to bridge domain gap for person re- identification

    Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 79–88, 2018. 2

  50. [50]

    Cavit: Contextual alignment vision transformer for video object re-identification

    Jinlin Wu, Lingxiao He, Wu Liu, Yang Yang, Zhen Lei, Tao Mei, and Stan Z Li. Cavit: Contextual alignment vision transformer for video object re-identification. InEuropean Conference on Computer Vision (ECCV), pages 549–566. Springer, 2022. 6

  51. [51]

    Temporal correlation vision transformer for video person re-identification

    Pengfei Wu, Le Wang, Sanping Zhou, Gang Hua, and Changyin Sun. Temporal correlation vision transformer for video person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6083– 6091, 2024. 3, 6

  52. [52]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4818–4829, 2024. 3

  53. [53]

    Learning multi-granular hypergraphs for video-based person re-identification

    Yichao Yan, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, and Ling Shao. Learning multi-granular hypergraphs for video-based person re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2896–2905, 2020. 6

  54. [54]

    Stfe: a comprehensive video-based person re- identification network based on spatio-temporal feature en- hancement.IEEE Transactions on Multimedia, 26:7237– 7249, 2024

    Xi Yang, Xian Wang, Liangchen Liu, Nannan Wang, and Xinbo Gao. Stfe: a comprehensive video-based person re- identification network based on spatio-temporal feature en- hancement.IEEE Transactions on Multimedia, 26:7237– 7249, 2024. 1, 2, 5

  55. [55]

    A pedestrian is worth one prompt: Towards language guidance person re-identification

    Zexian Yang, Dayan Wu, Chenming Wu, Zheng Lin, Jingzi Gu, and Weiping Wang. A pedestrian is worth one prompt: Towards language guidance person re-identification. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17343–17353, 2024. 2

  56. [56]

    Tf-clip: Learning text-free clip for video-based person re-identification

    Chenyang Yu, Xuehu Liu, Yingquan Wang, Pingping Zhang, and Huchuan Lu. Tf-clip: Learning text-free clip for video-based person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6764– 6772, 2024. 1, 2, 3, 5, 6, 7, 8, 4

  57. [57]

    Climb-reid: A hybrid clip- mamba framework for person re-identification

    Chenyang Yu, Xuehu Liu, Jiawen Zhu, Yuhao Wang, Ping- ping Zhang, and Huchuan Lu. Climb-reid: A hybrid clip- mamba framework for person re-identification. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 9589–9597, 2025. 3, 5

  58. [58]

    Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval.IEEE Transactions on Industrial Informatics, 18 (12):8776–8785, 2022

    Xianghao Zang, Ge Li, and Wei Gao. Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval.IEEE Transactions on Industrial Informatics, 18 (12):8776–8785, 2022. 1, 3, 6

  59. [59]

    Multi-prompts learning with cross-modal alignment for attribute-based person re-identification

    Yajing Zhai, Yawen Zeng, Zhiyong Huang, Zheng Qin, Xin Jin, and Da Cao. Multi-prompts learning with cross-modal alignment for attribute-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 6979–6987, 2024. 2

  60. [60]

    Cross- platform video person reid: A new benchmark dataset and adaptation approach

    Shizhou Zhang, Wenlong Luo, De Cheng, Qingchun Yang, Lingyan Ran, Yinghui Xing, and Yanning Zhang. Cross- platform video person reid: A new benchmark dataset and adaptation approach. InEuropean Conference on Computer Vision (ECCV), pages 270–287. Springer, 2024. 1, 2, 6, 7, 8

  61. [61]

    Fairmot: On the fairness of detection and re-identification in multiple object tracking.International journal of computer vision, 129(11):3069–3087, 2021

    Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking.International journal of computer vision, 129(11):3069–3087, 2021. 1

  62. [62]

    Multi-granularity reference-aided attentive feature ag- gregation for video-based person re-identification

    Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Multi-granularity reference-aided attentive feature ag- gregation for video-based person re-identification. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10407–10416, 2020. 1

  63. [63]

    Scalable person re-identification: A benchmark

    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 1116– 1124, 2015. 2

  64. [64]

    Mars: A video benchmark for large-scale person re-identification

    Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. InEuropean Confer- ence on Computer Vision (ECCV), pages 868–884. Springer,

  65. [65]

    Unlabeled sam- ples generated by gan improve the person re-identification baseline in vitro

    Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled sam- ples generated by gan improve the person re-identification baseline in vitro. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 3754– 3762, 2017. 2

  66. [66]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 6, 5

  67. [67]

    Omni-scale feature learning for person re- identification

    Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Omni-scale feature learning for person re- identification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3702–3712,

  68. [68]

    The person’s ID is [ID LABEL]

    2 Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification Supplementary Material A. Dataset A.1. Source MOT dataset To evaluate video-based person ReID methods in challeng- ing scenarios with high visual similarity between individ- uals, we construct two new datasets. We focus on Multi- Object Tracking (M...