Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

Haojie Liu; Mingyu Wang; Wanchong Xu; Wei Jiang; Weijie Mao; Zhiyong Li

arxiv: 2604.21324 · v1 · submitted 2026-04-23 · 💻 cs.CV

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

Zhiyong Li , Wei Jiang , Haojie Liu , Mingyu Wang , Wanchong Xu , Weijie Mao This is my paper

Pith reviewed 2026-05-09 21:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised person re-identificationvisible-infrared VI-ReIDvideo trackletstemporal prototypinghierarchical alignmentcontrastive learningcross-modality matching

0 comments

The pith

HiTPro uses hierarchical temporal prototypes to enable unsupervised matching of people across visible and infrared video tracklets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HiTPro, a prototype-driven approach for learning person identities from unlabeled RGB and infrared video tracklets. It extracts features with a temporal encoder, builds intra-camera prototypes by grouping sub-tracklets, and aligns them hierarchically from same-camera to cross-modality using soft assignments and contrastive learning. This addresses the gap in unsupervised video VI-ReID, which is important for practical all-day surveillance where labeling is costly. If successful, it allows models to discover identity correspondences purely from data structure without annotations.

Core claim

HiTPro constructs reliable intra-camera prototypes via temporal partitioning of tracklets and performs two-stage positive mining in Hierarchical Cross-Prototype Alignment, progressing from within-modality to cross-modality matching with dynamic thresholds and soft weights, then optimizes with hierarchical contrastive learning at intra-camera, cross-camera same-modality, and cross-modality levels, achieving state-of-the-art unsupervised performance on HITSZ-VCM and BUPTCampus.

What carries the argument

Hierarchical Temporal Prototyping, which aggregates features from temporally partitioned sub-tracklets into prototypes and aligns them across cameras and modalities without hard pseudo-labels.

If this is right

Outperforms adapted baselines significantly on two benchmark datasets under fully unsupervised settings.
Establishes a strong baseline for future research in unsupervised video-based VI-ReID.
The temporal-aware feature encoder provides robust tracklet representations that support prototype construction.
The two-stage alignment reduces the impact of modality gaps in positive mining.
Multi-level contrastive learning progressively enforces discrimination and invariance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prototypes reliably capture identities, similar prototyping could apply to other unlabeled multi-modal video tasks.
The dynamic threshold strategy might generalize to adaptively handle varying data qualities in re-identification.
Success here suggests that avoiding hard labels can mitigate error propagation in unsupervised cross-modal learning.
Future work could test integration with other temporal aggregation techniques for even better robustness.

Load-bearing premise

The constructed intra-camera prototypes and the hierarchical alignment process accurately identify true cross-modality identity matches from unlabeled tracklets without systematic errors from incorrect positives or large modality differences.

What would settle it

Running HiTPro on a dataset where ground-truth cross-modality correspondences are known and observing whether the learned features achieve high matching accuracy or if many wrong identities are aligned due to prototype errors.

Figures

Figures reproduced from arXiv: 2604.21324 by Haojie Liu, Mingyu Wang, Wanchong Xu, Wei Jiang, Weijie Mao, Zhiyong Li.

**Figure 2.** Figure 2: Illustration of the proposed HiTPro framework. The input video tracklets from visible and infrared modalities are processed by a Temporal-aware [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the Temporal-aware Feature Encoder (TFE). Input [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Influence of K on (a) HITSZ-VCM dataset and (b) BUPTCAMPUS [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Influence of K on (a) HITSZ-VCM dataset and (b) BUPTCAMPUS [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Cosine distance distributions of 40,000 randomly sampled positive [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of tracklet features for 20 randomly selected identities from the test set. Each color represents a unique identity; circles denote [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Top-5 Retrieval results of (a) Baseline and (b) HiTPro method on HITSZ-VCM dataset. Correct matches are highlighted with green boxes, while [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiTPro carves out unsupervised video VI-ReID with a staged prototype pipeline, but the cross-modal mining step lacks direct checks for false positives.

read the letter

The paper's main contribution is defining the unsupervised video-based visible-infrared re-identification setting and giving it a concrete pipeline. HiTPro encodes tracklets temporally, builds intra-camera prototypes by splitting sub-tracklets, then runs two-stage positive mining (within-modality first, then cross-modality) with a dynamic threshold and soft weights before applying three-level contrastive losses. That staged structure and the avoidance of hard pseudo-labels are the clearest novelties; prior work stayed supervised or image-only, so this opens a practical direction for label-free all-day surveillance systems.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HiTPro, a prototype-driven framework for unsupervised video-based visible-infrared person re-identification (VI-ReID). It consists of a Temporal-aware Feature Encoder that extracts frame-level features and aggregates them into tracklet representations, Intra-Camera Tracklet Prototyping to build reliable per-camera prototypes from temporally partitioned sub-tracklets, Hierarchical Cross-Prototype Alignment that performs two-stage positive mining (within-modality then cross-modality) using a Dynamic Threshold Strategy and Soft Weight Assignment, and Hierarchical Contrastive Learning that optimizes alignment at intra-camera, cross-camera same-modality, and cross-modality levels. Experiments on the HITSZ-VCM and BUPTCampus datasets report state-of-the-art unsupervised performance, significantly outperforming adapted baselines.

Significance. If the central empirical claims hold, the work is significant because it addresses the largely unexplored problem of fully unsupervised video VI-ReID, which is practically important for label-free all-day surveillance systems. The avoidance of hard pseudo-label assignment via soft prototype alignment and the explicit handling of temporal dynamics and modality gaps represent a promising technical direction. The paper establishes a reproducible baseline that future unsupervised video VI-ReID methods can build upon.

major comments (2)

[Section 3.3] Section 3.3 (Hierarchical Cross-Prototype Alignment): the two-stage positive mining (within-modality followed by cross-modality) with Dynamic Threshold Strategy and Soft Weight Assignment is load-bearing for the identity-discovery claim, yet the manuscript provides no direct measurement of mining precision/recall or ablation on error propagation from residual modality gaps after the Temporal-aware Feature Encoder; without these, it is unclear whether the reported SOTA gains on HITSZ-VCM and BUPTCampus arise from correct correspondences or from other factors such as stronger tracklet aggregation.
[Section 4] Section 4 (Experiments): the central claim of reliable unsupervised identity discovery rests on the Intra-Camera Tracklet Prototyping producing clean clusters, but no quantitative cluster-purity metrics, sensitivity analysis on the dynamic threshold, or ablation removing the hierarchical alignment step are presented; this leaves open the possibility that performance improvements are not attributable to the proposed alignment mechanism.

minor comments (2)

[Section 4.1] The description of baseline adaptations (e.g., how supervised VI-ReID methods were converted to unsupervised settings) is only summarized; explicit implementation details or code references would improve reproducibility.
[Section 3.4] Notation for the soft weight assignment and the three-level contrastive losses could be clarified with a single summary equation or table to avoid ambiguity when reading the hierarchical objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of empirical validation for our unsupervised video VI-ReID framework. We address each major comment below and will incorporate the suggested analyses in the revised manuscript to more clearly attribute performance gains to the proposed components.

read point-by-point responses

Referee: [Section 3.3] Section 3.3 (Hierarchical Cross-Prototype Alignment): the two-stage positive mining (within-modality followed by cross-modality) with Dynamic Threshold Strategy and Soft Weight Assignment is load-bearing for the identity-discovery claim, yet the manuscript provides no direct measurement of mining precision/recall or ablation on error propagation from residual modality gaps after the Temporal-aware Feature Encoder; without these, it is unclear whether the reported SOTA gains on HITSZ-VCM and BUPTCampus arise from correct correspondences or from other factors such as stronger tracklet aggregation.

Authors: We agree that direct measurements of mining precision and recall, along with targeted ablations on residual modality gaps, would strengthen the evidence that the two-stage positive mining drives the identity discovery. In the revised version, we will add a quantitative evaluation of mining accuracy (precision/recall) by using ground-truth identities from the evaluation sets for post-hoc verification of the mined positives, while preserving the fully unsupervised training protocol. We will also include an ablation that isolates the effect of modality gaps after the Temporal-aware Feature Encoder (e.g., by comparing the full two-stage process against a single-stage cross-modality baseline). These additions will demonstrate that the SOTA gains on HITSZ-VCM and BUPTCampus arise from the hierarchical alignment mechanism rather than tracklet aggregation alone. revision: yes
Referee: [Section 4] Section 4 (Experiments): the central claim of reliable unsupervised identity discovery rests on the Intra-Camera Tracklet Prototyping producing clean clusters, but no quantitative cluster-purity metrics, sensitivity analysis on the dynamic threshold, or ablation removing the hierarchical alignment step are presented; this leaves open the possibility that performance improvements are not attributable to the proposed alignment mechanism.

Authors: We acknowledge that cluster-purity metrics, sensitivity analysis, and a dedicated ablation removing the hierarchical alignment would provide clearer attribution of gains to the Intra-Camera Tracklet Prototyping and subsequent alignment steps. In the revision, we will report quantitative cluster purity metrics (e.g., purity score and normalized mutual information) for the prototypes generated from temporally partitioned sub-tracklets. We will also add a sensitivity analysis varying the dynamic threshold parameter across a range of values and report its impact on final Re-ID performance. Finally, we will include an ablation that removes the hierarchical cross-prototype alignment (retaining only intra-camera prototyping and basic contrastive learning) to isolate its contribution. These experiments will be conducted on both HITSZ-VCM and BUPTCampus to confirm that the reported improvements stem from the full proposed pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent algorithmic content

full rationale

The paper describes an algorithmic pipeline (Temporal-aware Feature Encoder, Intra-Camera Tracklet Prototyping, two-stage Hierarchical Cross-Prototype Alignment with Dynamic Threshold and Soft Weight Assignment, and Hierarchical Contrastive Learning) that generates its own pseudo-supervision from unlabeled tracklets. These steps constitute a standard self-supervised clustering-plus-contrastive loop rather than a mathematical derivation. No equations are shown that reduce claimed performance or identity correspondences to fitted parameters by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Validation is purely empirical on separate test sets (HITSZ-VCM, BUPTCampus), leaving the central claims falsifiable and non-tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The framework rests on several domain assumptions about feature discriminability and prototype reliability that are not independently verified in the provided abstract.

free parameters (2)

Dynamic Threshold
Used in positive mining process; value likely chosen or tuned per dataset.
Soft Weight Assignment parameters
Controls contribution of mined positives; specific values not stated.

axioms (2)

domain assumption Temporal-aware Feature Encoder produces sufficiently discriminative frame-level features for subsequent aggregation.
Invoked at the start of the pipeline to enable tracklet-level representations.
domain assumption Intra-camera sub-tracklet aggregation yields reliable prototypes without identity supervision.
Central to the first stage of prototype construction.

invented entities (2)

Intra-Camera Tracklet Prototypes no independent evidence
purpose: Serve as anchors for hierarchical alignment and contrastive learning.
Newly introduced construct for aggregating temporally partitioned features.
Hierarchical Cross-Prototype Alignment no independent evidence
purpose: Two-stage positive mining from within-modality to cross-modality.
Core novel mechanism for unsupervised cross-modal matching.

pith-pipeline@v0.9.0 · 5607 in / 1464 out tokens · 39417 ms · 2026-05-09T21:57:23.701876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

Deep learning for visible-infrared cross-modality person re-identification: A comprehen- sive review,

N. Huang, J. Liu, Y . Miao, Q. Zhang, and J. Han, “Deep learning for visible-infrared cross-modality person re-identification: A comprehen- sive review,”Information Fusion, 2022

work page 2022
[2]

Deep learning for person re-identification: A survey and outlook,

M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 6, pp. 2872– 2893, 2021

work page 2021
[3]

A survey of open-world person re- identification,

Q. Leng, M. Ye, and Q. Tian, “A survey of open-world person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1092–1108, 2019

work page 2019
[4]

Rgb-infrared cross- modality person re-identification,

A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross- modality person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2017, pp. 5390–5399

work page 2017
[5]

Learning modality-specific representations for visible-infrared person re-identification,

Z. Feng, J. Lai, and X. Xie, “Learning modality-specific representations for visible-infrared person re-identification,”IEEE Transactions on Im- age Processing, vol. 29, pp. 579–590, 2019

work page 2019
[6]

Learning progressive modality-shared transformers for effective visible-infrared person re-identification,

H. Lu, X. Zou, and P. Zhang, “Learning progressive modality-shared transformers for effective visible-infrared person re-identification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 1835–1843

work page 2023
[7]

Cross- modality person re-identification with shared-specific feature transfer,

Y . Lu, Y . Wu, B. Liu, T. Zhang, B. Li, Q. Chu, and N. Yu, “Cross- modality person re-identification with shared-specific feature transfer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 379–13 389

work page 2020
[8]

Grayscale enhancement colorization network for visible-infrared person re-identification,

X. Zhong, T. Lu, W. Huang, M. Ye, X. Jia, and C.-W. Lin, “Grayscale enhancement colorization network for visible-infrared person re-identification,”IEEE transactions on circuits and systems for video technology, vol. 32, no. 3, pp. 1418–1430, 2021. 13

work page 2021
[9]

Cross-modality person re- identification via channel-based partition network,

J. Liu, W. Song, C. Chen, and F. Liu, “Cross-modality person re- identification via channel-based partition network,”Applied intelligence, vol. 52, no. 3, pp. 2423–2435, 2022

work page 2022
[10]

A generative-based image fusion strategy for visible-infrared person re-identification,

J. Qi, T. Liang, W. Liu, Y . Li, and Y . Jin, “A generative-based image fusion strategy for visible-infrared person re-identification,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 34, no. 1, pp. 518–533, 2023

work page 2023
[11]

Video- based person re-identification with accumulative motion context,

H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng, “Video- based person re-identification with accumulative motion context,”IEEE transactions on circuits and systems for video technology, vol. 28, no. 10, pp. 2788–2802, 2017

work page 2017
[12]

Multi-level factorisation net for person re-identification,

X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2109–2118

work page 2018
[13]

Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,

X. Lin, J. Li, Z. Ma, H. Li, S. Li, K. Xu, G. Lu, and D. Zhang, “Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 973–20 982

work page 2022
[14]

Video-based visible-infrared person re-identification with auxiliary samples,

Y . Du, C. Lei, Z. Zhao, Y . Dong, and F. Su, “Video-based visible-infrared person re-identification with auxiliary samples,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1313–1325, 2023

work page 2023
[15]

Multi-memory matching for unsupervised visible-infrared person re- identification,

J. Shi, X. Yin, Y . Chen, Y . Zhang, Z. Zhang, Y . Xie, and Y . Qu, “Multi-memory matching for unsupervised visible-infrared person re- identification,” inEuropean Conference on Computer Vision, 2024, pp. 456–474

work page 2024
[16]

Dual consistency-constrained learning for unsupervised visible-infrared person re-identification,

B. Yang, J. Chen, C. Chen, and M. Ye, “Dual consistency-constrained learning for unsupervised visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1767– 1779, 2023

work page 2023
[17]

Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,

B. Yang, M. Ye, J. Chen, and Z. Wu, “Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2843–2851

work page 2022
[18]

A density-based algorithm for discovering clusters in large spatial databases with noise,

M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inkdd, 1996, pp. 226–231

work page 1996
[19]

Unsupervised tracklet person re- identification,

M. Li, X. Zhu, and S. Gong, “Unsupervised tracklet person re- identification,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 7, pp. 1770–1782, 2019

work page 2019
[20]

Hierarchical discriminative learning for visible thermal person re-identification,

M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminative learning for visible thermal person re-identification,” inProceedings of the AAAI conference on artificial intelligence, 2018, pp. 7501–7508

work page 2018
[21]

Cross-modality person re- identification via modality-aware collaborative ensemble learning,

M. Ye, X. Lan, Q. Leng, and J. Shen, “Cross-modality person re- identification via modality-aware collaborative ensemble learning,”IEEE Transactions on Image Processing, vol. 29, pp. 9387–9399, 2020

work page 2020
[22]

Structure-aware positional transformer for visible-infrared person re-identification,

C. Chen, M. Ye, M. Qi, J. Wu, J. Jiang, and C.-W. Lin, “Structure-aware positional transformer for visible-infrared person re-identification,”IEEE Transactions on Image Processing, vol. 31, pp. 2352–2364, 2022

work page 2022
[23]

Dual-stream transformer with distribution alignment for visible-infrared person re- identification,

Z. Chai, Y . Ling, Z. Luo, D. Lin, M. Jiang, and S. Li, “Dual-stream transformer with distribution alignment for visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 11, pp. 6764–6776, 2023

work page 2023
[24]

Cross- modality transformer for visible-infrared person re-identification,

K. Jiang, T. Zhang, X. Liu, B. Qian, Y . Zhang, and F. Wu, “Cross- modality transformer for visible-infrared person re-identification,” in European conference on computer vision, 2022, pp. 480–496

work page 2022
[25]

Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification,

H. Liu, D. Xia, and W. Jiang, “Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification,”IEEE Journal of selected topics in signal processing, vol. 17, no. 3, pp. 545–559, 2023

work page 2023
[26]

Fmcnet: Feature-level modality compensation for visible-infrared person re-identification,

Q. Zhang, C. Lai, J. Liu, N. Huang, and J. Han, “Fmcnet: Feature-level modality compensation for visible-infrared person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7349–7358

work page 2022
[27]

Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,

Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 2153–2162

work page 2023
[28]

Stochastic style perturbation modelling for visible-infrared person re-identification with severely modality imbalance,

H. Liu, Z. Li, J. Gu, M. Wang, Q. J. Wu, and W. Jiang, “Stochastic style perturbation modelling for visible-infrared person re-identification with severely modality imbalance,”Neural Networks, p. 108206, 2025

work page 2025
[29]

Frequency domain nuances mining for visible-infrared person re-identification,

Y . Zhang, H. Wang, Y . Lu, Y . Yan, and X. Li, “Frequency domain nuances mining for visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, 2025

work page 2025
[30]

Discovering multi- frequency embedding for visible-infrared person re-identification,

H. Gu, X. Yang, R. Lu, L. Pu, S. Han, and M. Wu, “Discovering multi- frequency embedding for visible-infrared person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[31]

Wavelet-based frequency feature learning for visible-infrared person re-identification,

T. Yu, D. Cheng, H. Jiang, L. Chen, J. Qian, Q. Kou, and G. Zhai, “Wavelet-based frequency feature learning for visible-infrared person re-identification,”IEEE Transactions on Consumer Electronics, 2026

work page 2026
[32]

Homogeneous-to- heterogeneous: Unsupervised learning for rgb-infrared person re- identification,

W. Liang, G. Wang, J. Lai, and X. Xie, “Homogeneous-to- heterogeneous: Unsupervised learning for rgb-infrared person re- identification,”IEEE Transactions on Image Processing, vol. 30, pp. 6392–6407, 2021

work page 2021
[33]

Optimal transport for label-efficient visible-infrared person re-identification,

J. Wang, Z. Zhang, M. Chen, Y . Zhang, C. Wang, B. Sheng, Y . Qu, and Y . Xie, “Optimal transport for label-efficient visible-infrared person re-identification,” inEuropean conference on computer vision, 2022, pp. 93–109

work page 2022
[34]

Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,

Z. Wu and M. Ye, “Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9548–9558

work page 2023
[35]

Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification,

Z. Pang, C. Wang, L. Zhao, Y . Liu, and G. Sharma, “Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification,”IEEE Transactions on circuits and systems for video technology, vol. 34, no. 4, pp. 2706–2718, 2023

work page 2023
[36]

Inter-intra modality knowledge learning and clustering noise alleviation for unsupervised visible-infrared person re-identification,

Z. Li, H. Liu, X. Peng, and W. Jiang, “Inter-intra modality knowledge learning and clustering noise alleviation for unsupervised visible-infrared person re-identification,”Transactions on Knowledge and Data Engi- neering, vol. 36, no. 8, pp. 3934–3947, 2024

work page 2024
[37]

Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,

B. Yang, J. Chen, and M. Ye, “Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2024, pp. 16 870–16 879

work page 2024
[38]

Adaptive pseudo-label purification and debiasing for unsupervised visible-infrared person re- identification,

X. Yin, J. Shi, Z. Zhang, Y . Xie, and Y . Qu, “Adaptive pseudo-label purification and debiasing for unsupervised visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[39]

Recurrent convolu- tional network for video-based person re-identification,

N. McLaughlin, J. M. Del Rincon, and P. Miller, “Recurrent convolu- tional network for video-based person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1325–1334

work page 2016
[40]

See the forest for the trees: Joint spatial and temporal recurrent neural networks for video- based person re-identification,

Z. Zhou, Y . Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video- based person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4747–4756

work page 2017
[41]

Appearance- preserving 3d convolution for video-based person re-identification,

X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance- preserving 3d convolution for video-based person re-identification,” in European conference on computer vision. Springer, 2020, pp. 228–243

work page 2020
[42]

Multi-scale 3d convolution network for video based person re-identification,

J. Li, S. Zhang, and T. Huang, “Multi-scale 3d convolution network for video based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8618– 8625

work page 2019
[43]

Pyramid spatial-temporal aggregation for video-based person re-identification,

Y . Wang, P. Zhang, S. Gao, X. Geng, H. Lu, and D. Wang, “Pyramid spatial-temporal aggregation for video-based person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 026–12 035

work page 2021
[44]

Temporal coherence or temporal motion: Which is more critical for video-based person re-identification?

G. Chen, Y . Rao, J. Lu, and J. Zhou, “Temporal coherence or temporal motion: Which is more critical for video-based person re-identification?” inEuropean conference on computer vision. Springer, 2020, pp. 660– 676

work page 2020
[45]

Temporal complemen- tary learning for video person re-identification,

R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen, “Temporal complemen- tary learning for video person re-identification,” inEuropean conference on computer vision. Springer, 2020, pp. 388–405

work page 2020
[46]

Spatial- temporal graph convolutional network for video-based person re- identification,

J. Yang, W.-S. Zheng, Q. Yang, Y .-C. Chen, and Q. Tian, “Spatial- temporal graph convolutional network for video-based person re- identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3289–3299

work page 2020
[47]

Sta: Spatial-temporal attention for large-scale video-based person re-identification,

Y . Fu, X. Wang, Y . Wei, and T. Huang, “Sta: Spatial-temporal attention for large-scale video-based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8287–8294

work page 2019
[48]

Multiscale aligned spatial–temporal interaction for video-based person re-identification,

Z. Ran, X. Wei, W. Liu, and X. Lu, “Multiscale aligned spatial–temporal interaction for video-based person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8536– 8546, 2024

work page 2024
[49]

Spatial and temporal mutual promotion for video-based person re-identification,

Y . Liu, Z. Yuan, W. Zhou, and H. Li, “Spatial and temporal mutual promotion for video-based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8786–8793

work page 2019
[50]

Recursively learning fine-grained spatial–temporal features for video-based person re-identification,

H. Ma, C. Zhang, Z. Li, and Z. Wang, “Recursively learning fine-grained spatial–temporal features for video-based person re-identification,”En- gineering Applications of Artificial Intelligence, vol. 148, p. 110429, 2025. 14

work page 2025
[51]

Unsupervised person re-identification by deep learning tracklet association,

M. Li, X. Zhu, and S. Gong, “Unsupervised person re-identification by deep learning tracklet association,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 737–753

work page 2018
[52]

Unsupervised video person re-identification via noise and hard frame aware clustering,

P. Xie, X. Xu, Z. Wang, and T. Yamasaki, “Unsupervised video person re-identification via noise and hard frame aware clustering,”arXiv preprint arXiv:2106.05441, 2021

work page arXiv 2021
[53]

Successive consensus clustering for unsupervised video-based person re-identification,

J. Qian and X. Xie, “Successive consensus clustering for unsupervised video-based person re-identification,”IEEE Signal Processing Letters, vol. 29, pp. 822–826, 2022

work page 2022
[54]

Sampling and re- weighting: Towards diverse frame aware unsupervised video person re- identification,

P. Xie, X. Xu, Z. Wang, and T. Yamasaki, “Sampling and re- weighting: Towards diverse frame aware unsupervised video person re- identification,”IEEE Transactions on Multimedia, vol. 24, pp. 4250– 4261, 2022

work page 2022
[55]

Dual represen- tation modeling and progressive contrastive learning for unsupervised video person re-identification,

C. Zhang, Y . Su, N. Wang, Y . Lan, T. Wang, and A. Li, “Dual represen- tation modeling and progressive contrastive learning for unsupervised video person re-identification,”Neurocomputing, vol. 645, p. 130467, 2025

work page 2025
[56]

Intermediary-guided bidi- rectional spatial–temporal aggregation network for video-based visible- infrared person re-identification,

H. Li, M. Liu, Z. Hu, F. Nie, and Z. Yu, “Intermediary-guided bidi- rectional spatial–temporal aggregation network for video-based visible- infrared person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4962–4972, 2023

work page 2023
[57]

Fa-net: A feature alignment network for video-based visible-infrared person re- identification,

X. Yang, W. Dong, X. Wang, D. Cheng, and N. Wang, “Fa-net: A feature alignment network for video-based visible-infrared person re- identification,”IEEE Transactions on Image Processing, vol. 34, pp. 8406–8420, 2025

work page 2025
[58]

Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification,

Z. Zuo, H. Li, Y . Zhang, and M. Xie, “Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification,”Image and Vision Computing, vol. 157, p. 105518, 2025

work page 2025
[59]

Spatial-temporal high-frequency learning for video-based visible-infrared person re- identification,

S. Tao, S. Li, J. Ye, N. Dong, F. Li, and H. Li, “Spatial-temporal high-frequency learning for video-based visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

work page 2026
[60]

Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,

H. Park, S. Lee, J. Lee, and B. Ham, “Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 046–12 055

work page 2021
[61]

Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,

M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,” inEuropean conference on computer vision, 2020, pp. 229–247

work page 2020
[62]

Channel augmented joint learning for visible-infrared recognition,

M. Ye, W. Ruan, B. Du, and M. Z. Shou, “Channel augmented joint learning for visible-infrared recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 567–13 576

work page 2021
[63]

Adversarial self-attack defense and spatial-temporal relation mining for visible-infrared video person re-identification,

L. Xu, H. Li, Y . Zhang, and D. Tao, “Adversarial self-attack defense and spatial-temporal relation mining for visible-infrared video person re-identification,”International Journal of Machine Learning and Cy- bernetics, vol. 16, no. 10, pp. 7843–7858, 2025

work page 2025
[64]

Shape-centered repre- sentation learning for visible–infrared person re-identification,

S. Li, J. Leng, J. Gan, M. Mo, and X. Gao, “Shape-centered repre- sentation learning for visible–infrared person re-identification,”Pattern Recognition, vol. 167, p. 111756, 2025

work page 2025
[65]

Cluster contrast for unsupervised person re-identification,

Z. Dai, G. Wang, W. Yuan, S. Zhu, and P. Tan, “Cluster contrast for unsupervised person re-identification,” inProceedings of the Asian conference on computer vision, 2022, pp. 1142–1160

work page 2022
[66]

Cross-camera discriminative person association by unsupervised frame clustering and selection,

Q. Li, M. Gao, G. Zhang, W. Zhai, and G. Jeon, “Cross-camera discriminative person association by unsupervised frame clustering and selection,”IEEE Internet of Things Journal, 2025

work page 2025

[1] [1]

Deep learning for visible-infrared cross-modality person re-identification: A comprehen- sive review,

N. Huang, J. Liu, Y . Miao, Q. Zhang, and J. Han, “Deep learning for visible-infrared cross-modality person re-identification: A comprehen- sive review,”Information Fusion, 2022

work page 2022

[2] [2]

Deep learning for person re-identification: A survey and outlook,

M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 6, pp. 2872– 2893, 2021

work page 2021

[3] [3]

A survey of open-world person re- identification,

Q. Leng, M. Ye, and Q. Tian, “A survey of open-world person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 1092–1108, 2019

work page 2019

[4] [4]

Rgb-infrared cross- modality person re-identification,

A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross- modality person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2017, pp. 5390–5399

work page 2017

[5] [5]

Learning modality-specific representations for visible-infrared person re-identification,

Z. Feng, J. Lai, and X. Xie, “Learning modality-specific representations for visible-infrared person re-identification,”IEEE Transactions on Im- age Processing, vol. 29, pp. 579–590, 2019

work page 2019

[6] [6]

Learning progressive modality-shared transformers for effective visible-infrared person re-identification,

H. Lu, X. Zou, and P. Zhang, “Learning progressive modality-shared transformers for effective visible-infrared person re-identification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 1835–1843

work page 2023

[7] [7]

Cross- modality person re-identification with shared-specific feature transfer,

Y . Lu, Y . Wu, B. Liu, T. Zhang, B. Li, Q. Chu, and N. Yu, “Cross- modality person re-identification with shared-specific feature transfer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 379–13 389

work page 2020

[8] [8]

Grayscale enhancement colorization network for visible-infrared person re-identification,

X. Zhong, T. Lu, W. Huang, M. Ye, X. Jia, and C.-W. Lin, “Grayscale enhancement colorization network for visible-infrared person re-identification,”IEEE transactions on circuits and systems for video technology, vol. 32, no. 3, pp. 1418–1430, 2021. 13

work page 2021

[9] [9]

Cross-modality person re- identification via channel-based partition network,

J. Liu, W. Song, C. Chen, and F. Liu, “Cross-modality person re- identification via channel-based partition network,”Applied intelligence, vol. 52, no. 3, pp. 2423–2435, 2022

work page 2022

[10] [10]

A generative-based image fusion strategy for visible-infrared person re-identification,

J. Qi, T. Liang, W. Liu, Y . Li, and Y . Jin, “A generative-based image fusion strategy for visible-infrared person re-identification,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 34, no. 1, pp. 518–533, 2023

work page 2023

[11] [11]

Video- based person re-identification with accumulative motion context,

H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng, “Video- based person re-identification with accumulative motion context,”IEEE transactions on circuits and systems for video technology, vol. 28, no. 10, pp. 2788–2802, 2017

work page 2017

[12] [12]

Multi-level factorisation net for person re-identification,

X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2109–2118

work page 2018

[13] [13]

Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,

X. Lin, J. Li, Z. Ma, H. Li, S. Li, K. Xu, G. Lu, and D. Zhang, “Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 973–20 982

work page 2022

[14] [14]

Video-based visible-infrared person re-identification with auxiliary samples,

Y . Du, C. Lei, Z. Zhao, Y . Dong, and F. Su, “Video-based visible-infrared person re-identification with auxiliary samples,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1313–1325, 2023

work page 2023

[15] [15]

Multi-memory matching for unsupervised visible-infrared person re- identification,

J. Shi, X. Yin, Y . Chen, Y . Zhang, Z. Zhang, Y . Xie, and Y . Qu, “Multi-memory matching for unsupervised visible-infrared person re- identification,” inEuropean Conference on Computer Vision, 2024, pp. 456–474

work page 2024

[16] [16]

Dual consistency-constrained learning for unsupervised visible-infrared person re-identification,

B. Yang, J. Chen, C. Chen, and M. Ye, “Dual consistency-constrained learning for unsupervised visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1767– 1779, 2023

work page 2023

[17] [17]

Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,

B. Yang, M. Ye, J. Chen, and Z. Wu, “Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2843–2851

work page 2022

[18] [18]

A density-based algorithm for discovering clusters in large spatial databases with noise,

M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inkdd, 1996, pp. 226–231

work page 1996

[19] [19]

Unsupervised tracklet person re- identification,

M. Li, X. Zhu, and S. Gong, “Unsupervised tracklet person re- identification,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 7, pp. 1770–1782, 2019

work page 2019

[20] [20]

Hierarchical discriminative learning for visible thermal person re-identification,

M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminative learning for visible thermal person re-identification,” inProceedings of the AAAI conference on artificial intelligence, 2018, pp. 7501–7508

work page 2018

[21] [21]

Cross-modality person re- identification via modality-aware collaborative ensemble learning,

M. Ye, X. Lan, Q. Leng, and J. Shen, “Cross-modality person re- identification via modality-aware collaborative ensemble learning,”IEEE Transactions on Image Processing, vol. 29, pp. 9387–9399, 2020

work page 2020

[22] [22]

Structure-aware positional transformer for visible-infrared person re-identification,

C. Chen, M. Ye, M. Qi, J. Wu, J. Jiang, and C.-W. Lin, “Structure-aware positional transformer for visible-infrared person re-identification,”IEEE Transactions on Image Processing, vol. 31, pp. 2352–2364, 2022

work page 2022

[23] [23]

Dual-stream transformer with distribution alignment for visible-infrared person re- identification,

Z. Chai, Y . Ling, Z. Luo, D. Lin, M. Jiang, and S. Li, “Dual-stream transformer with distribution alignment for visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 11, pp. 6764–6776, 2023

work page 2023

[24] [24]

Cross- modality transformer for visible-infrared person re-identification,

K. Jiang, T. Zhang, X. Liu, B. Qian, Y . Zhang, and F. Wu, “Cross- modality transformer for visible-infrared person re-identification,” in European conference on computer vision, 2022, pp. 480–496

work page 2022

[25] [25]

Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification,

H. Liu, D. Xia, and W. Jiang, “Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification,”IEEE Journal of selected topics in signal processing, vol. 17, no. 3, pp. 545–559, 2023

work page 2023

[26] [26]

Fmcnet: Feature-level modality compensation for visible-infrared person re-identification,

Q. Zhang, C. Lai, J. Liu, N. Huang, and J. Han, “Fmcnet: Feature-level modality compensation for visible-infrared person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7349–7358

work page 2022

[27] [27]

Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,

Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 2153–2162

work page 2023

[28] [28]

Stochastic style perturbation modelling for visible-infrared person re-identification with severely modality imbalance,

H. Liu, Z. Li, J. Gu, M. Wang, Q. J. Wu, and W. Jiang, “Stochastic style perturbation modelling for visible-infrared person re-identification with severely modality imbalance,”Neural Networks, p. 108206, 2025

work page 2025

[29] [29]

Frequency domain nuances mining for visible-infrared person re-identification,

Y . Zhang, H. Wang, Y . Lu, Y . Yan, and X. Li, “Frequency domain nuances mining for visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, 2025

work page 2025

[30] [30]

Discovering multi- frequency embedding for visible-infrared person re-identification,

H. Gu, X. Yang, R. Lu, L. Pu, S. Han, and M. Wu, “Discovering multi- frequency embedding for visible-infrared person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[31] [31]

Wavelet-based frequency feature learning for visible-infrared person re-identification,

T. Yu, D. Cheng, H. Jiang, L. Chen, J. Qian, Q. Kou, and G. Zhai, “Wavelet-based frequency feature learning for visible-infrared person re-identification,”IEEE Transactions on Consumer Electronics, 2026

work page 2026

[32] [32]

Homogeneous-to- heterogeneous: Unsupervised learning for rgb-infrared person re- identification,

W. Liang, G. Wang, J. Lai, and X. Xie, “Homogeneous-to- heterogeneous: Unsupervised learning for rgb-infrared person re- identification,”IEEE Transactions on Image Processing, vol. 30, pp. 6392–6407, 2021

work page 2021

[33] [33]

Optimal transport for label-efficient visible-infrared person re-identification,

J. Wang, Z. Zhang, M. Chen, Y . Zhang, C. Wang, B. Sheng, Y . Qu, and Y . Xie, “Optimal transport for label-efficient visible-infrared person re-identification,” inEuropean conference on computer vision, 2022, pp. 93–109

work page 2022

[34] [34]

Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,

Z. Wu and M. Ye, “Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9548–9558

work page 2023

[35] [35]

Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification,

Z. Pang, C. Wang, L. Zhao, Y . Liu, and G. Sharma, “Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification,”IEEE Transactions on circuits and systems for video technology, vol. 34, no. 4, pp. 2706–2718, 2023

work page 2023

[36] [36]

Inter-intra modality knowledge learning and clustering noise alleviation for unsupervised visible-infrared person re-identification,

Z. Li, H. Liu, X. Peng, and W. Jiang, “Inter-intra modality knowledge learning and clustering noise alleviation for unsupervised visible-infrared person re-identification,”Transactions on Knowledge and Data Engi- neering, vol. 36, no. 8, pp. 3934–3947, 2024

work page 2024

[37] [37]

Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,

B. Yang, J. Chen, and M. Ye, “Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2024, pp. 16 870–16 879

work page 2024

[38] [38]

Adaptive pseudo-label purification and debiasing for unsupervised visible-infrared person re- identification,

X. Yin, J. Shi, Z. Zhang, Y . Xie, and Y . Qu, “Adaptive pseudo-label purification and debiasing for unsupervised visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[39] [39]

Recurrent convolu- tional network for video-based person re-identification,

N. McLaughlin, J. M. Del Rincon, and P. Miller, “Recurrent convolu- tional network for video-based person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1325–1334

work page 2016

[40] [40]

See the forest for the trees: Joint spatial and temporal recurrent neural networks for video- based person re-identification,

Z. Zhou, Y . Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video- based person re-identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4747–4756

work page 2017

[41] [41]

Appearance- preserving 3d convolution for video-based person re-identification,

X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance- preserving 3d convolution for video-based person re-identification,” in European conference on computer vision. Springer, 2020, pp. 228–243

work page 2020

[42] [42]

Multi-scale 3d convolution network for video based person re-identification,

J. Li, S. Zhang, and T. Huang, “Multi-scale 3d convolution network for video based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8618– 8625

work page 2019

[43] [43]

Pyramid spatial-temporal aggregation for video-based person re-identification,

Y . Wang, P. Zhang, S. Gao, X. Geng, H. Lu, and D. Wang, “Pyramid spatial-temporal aggregation for video-based person re-identification,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 026–12 035

work page 2021

[44] [44]

Temporal coherence or temporal motion: Which is more critical for video-based person re-identification?

G. Chen, Y . Rao, J. Lu, and J. Zhou, “Temporal coherence or temporal motion: Which is more critical for video-based person re-identification?” inEuropean conference on computer vision. Springer, 2020, pp. 660– 676

work page 2020

[45] [45]

Temporal complemen- tary learning for video person re-identification,

R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen, “Temporal complemen- tary learning for video person re-identification,” inEuropean conference on computer vision. Springer, 2020, pp. 388–405

work page 2020

[46] [46]

Spatial- temporal graph convolutional network for video-based person re- identification,

J. Yang, W.-S. Zheng, Q. Yang, Y .-C. Chen, and Q. Tian, “Spatial- temporal graph convolutional network for video-based person re- identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3289–3299

work page 2020

[47] [47]

Sta: Spatial-temporal attention for large-scale video-based person re-identification,

Y . Fu, X. Wang, Y . Wei, and T. Huang, “Sta: Spatial-temporal attention for large-scale video-based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8287–8294

work page 2019

[48] [48]

Multiscale aligned spatial–temporal interaction for video-based person re-identification,

Z. Ran, X. Wei, W. Liu, and X. Lu, “Multiscale aligned spatial–temporal interaction for video-based person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8536– 8546, 2024

work page 2024

[49] [49]

Spatial and temporal mutual promotion for video-based person re-identification,

Y . Liu, Z. Yuan, W. Zhou, and H. Li, “Spatial and temporal mutual promotion for video-based person re-identification,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8786–8793

work page 2019

[50] [50]

Recursively learning fine-grained spatial–temporal features for video-based person re-identification,

H. Ma, C. Zhang, Z. Li, and Z. Wang, “Recursively learning fine-grained spatial–temporal features for video-based person re-identification,”En- gineering Applications of Artificial Intelligence, vol. 148, p. 110429, 2025. 14

work page 2025

[51] [51]

Unsupervised person re-identification by deep learning tracklet association,

M. Li, X. Zhu, and S. Gong, “Unsupervised person re-identification by deep learning tracklet association,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 737–753

work page 2018

[52] [52]

Unsupervised video person re-identification via noise and hard frame aware clustering,

P. Xie, X. Xu, Z. Wang, and T. Yamasaki, “Unsupervised video person re-identification via noise and hard frame aware clustering,”arXiv preprint arXiv:2106.05441, 2021

work page arXiv 2021

[53] [53]

Successive consensus clustering for unsupervised video-based person re-identification,

J. Qian and X. Xie, “Successive consensus clustering for unsupervised video-based person re-identification,”IEEE Signal Processing Letters, vol. 29, pp. 822–826, 2022

work page 2022

[54] [54]

Sampling and re- weighting: Towards diverse frame aware unsupervised video person re- identification,

P. Xie, X. Xu, Z. Wang, and T. Yamasaki, “Sampling and re- weighting: Towards diverse frame aware unsupervised video person re- identification,”IEEE Transactions on Multimedia, vol. 24, pp. 4250– 4261, 2022

work page 2022

[55] [55]

Dual represen- tation modeling and progressive contrastive learning for unsupervised video person re-identification,

C. Zhang, Y . Su, N. Wang, Y . Lan, T. Wang, and A. Li, “Dual represen- tation modeling and progressive contrastive learning for unsupervised video person re-identification,”Neurocomputing, vol. 645, p. 130467, 2025

work page 2025

[56] [56]

Intermediary-guided bidi- rectional spatial–temporal aggregation network for video-based visible- infrared person re-identification,

H. Li, M. Liu, Z. Hu, F. Nie, and Z. Yu, “Intermediary-guided bidi- rectional spatial–temporal aggregation network for video-based visible- infrared person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4962–4972, 2023

work page 2023

[57] [57]

Fa-net: A feature alignment network for video-based visible-infrared person re- identification,

X. Yang, W. Dong, X. Wang, D. Cheng, and N. Wang, “Fa-net: A feature alignment network for video-based visible-infrared person re- identification,”IEEE Transactions on Image Processing, vol. 34, pp. 8406–8420, 2025

work page 2025

[58] [58]

Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification,

Z. Zuo, H. Li, Y . Zhang, and M. Xie, “Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification,”Image and Vision Computing, vol. 157, p. 105518, 2025

work page 2025

[59] [59]

Spatial-temporal high-frequency learning for video-based visible-infrared person re- identification,

S. Tao, S. Li, J. Ye, N. Dong, F. Li, and H. Li, “Spatial-temporal high-frequency learning for video-based visible-infrared person re- identification,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

work page 2026

[60] [60]

Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,

H. Park, S. Lee, J. Lee, and B. Ham, “Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 046–12 055

work page 2021

[61] [61]

Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,

M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,” inEuropean conference on computer vision, 2020, pp. 229–247

work page 2020

[62] [62]

Channel augmented joint learning for visible-infrared recognition,

M. Ye, W. Ruan, B. Du, and M. Z. Shou, “Channel augmented joint learning for visible-infrared recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 567–13 576

work page 2021

[63] [63]

Adversarial self-attack defense and spatial-temporal relation mining for visible-infrared video person re-identification,

L. Xu, H. Li, Y . Zhang, and D. Tao, “Adversarial self-attack defense and spatial-temporal relation mining for visible-infrared video person re-identification,”International Journal of Machine Learning and Cy- bernetics, vol. 16, no. 10, pp. 7843–7858, 2025

work page 2025

[64] [64]

Shape-centered repre- sentation learning for visible–infrared person re-identification,

S. Li, J. Leng, J. Gan, M. Mo, and X. Gao, “Shape-centered repre- sentation learning for visible–infrared person re-identification,”Pattern Recognition, vol. 167, p. 111756, 2025

work page 2025

[65] [65]

Cluster contrast for unsupervised person re-identification,

Z. Dai, G. Wang, W. Yuan, S. Zhu, and P. Tan, “Cluster contrast for unsupervised person re-identification,” inProceedings of the Asian conference on computer vision, 2022, pp. 1142–1160

work page 2022

[66] [66]

Cross-camera discriminative person association by unsupervised frame clustering and selection,

Q. Li, M. Gao, G. Zhang, W. Zhai, and G. Jeon, “Cross-camera discriminative person association by unsupervised frame clustering and selection,”IEEE Internet of Things Journal, 2025

work page 2025