pith. sign in

arxiv: 2504.09549 · v3 · pith:OIPX7FVGnew · submitted 2025-04-13 · 💻 cs.CV

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Pith reviewed 2026-05-22 19:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords aerial-ground person re-identificationstable diffusiongenerative modelsview-aware featuresperson re-idview refinementcomputer vision
0
0 comments X

The pith

Fine-tuning Stable Diffusion on identity and view conditions from a ViT model generates view-mimicking features that improve aerial-ground person re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SD-ReID, a generative framework that trains a ViT-based extractor to capture identity and view conditions, then fine-tunes Stable Diffusion to produce features mimicking different camera perspectives while keeping identity information intact. This contrasts with prior methods that focus only on making representations invariant to viewpoint changes. A View-Refined Decoder integrates instance-level details with global features, and the combined representations are used for retrieval. Experiments across five benchmarks show gains in matching persons between aerial and ground views. If the approach holds, it offers a way to leverage generative models to handle extreme viewpoint gaps without discarding view-specific information.

Core claim

The authors claim that extracting controllable identity and view conditions via a ViT-based model, using those conditions to fine-tune Stable Diffusion for enhanced person representations, and applying a View-Refined Decoder to merge instance-level and global-level features yields improved retrieval of specific persons across aerial and ground cameras on the CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR, and G2APS-ReID datasets.

What carries the argument

The fine-tuned Stable Diffusion model guided by identity and view conditions extracted from a ViT-based model, together with the View-Refined Decoder that integrates instance-level and global-level features.

Load-bearing premise

Fine-tuning Stable Diffusion with identity and view conditions extracted by a ViT-based model produces view-mimicking features that improve rather than degrade identity discrimination, and the View-Refined Decoder integrates instance-level and global-level features without introducing new inconsistencies.

What would settle it

If adding the generated view-mimicking features and View-Refined Decoder outputs lowers retrieval accuracy on the five AG-ReID benchmarks relative to the ViT baseline alone, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2504.09549 by Huchuan Lu, Lixin Wang, Pingping Zhang, Xiang Hu, Yuhao Wang.

Figure 1
Figure 1. Figure 1: Motivations. (a) Previous AG-ReID methods focus on extracting view [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of the proposed SD-ReID. In the first stage, a view-aware Transformer encoder extracts person representations ˜ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Details of the condition learner based on aerial input. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inference process from aerial input to ground view feature generation. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison with different numbers of identity conditions [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed structures of different VRD mechanisms. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance with different timesteps τ under the G→A protocol. Stage1 Stage2 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of trainable parameters across existing baselines and [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of feature distributions with t-SNE [ [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Rank list comparison among VDT, SD-ReID’s stage 1, and SD-ReID’s stage 2 on challenging examples. Green boxes indicate correct matches, while [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of activation maps and feature similarities. (a) and [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
read the original abstract

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SD-ReID, a generative framework for Aerial-Ground Person Re-Identification (AG-ReID). It first trains a ViT-based model to extract identity and view conditions from person images. These conditions then guide fine-tuning of a Stable Diffusion model to mimic feature distributions across views. A View-Refined Decoder (VRD) is introduced to integrate instance-level and global-level features. The resulting person representations and all-view features are used together for retrieval. The authors assert that this approach improves robustness to viewpoint changes and report effectiveness on five AG-ReID benchmarks (CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR, G2APS-ReID), with code and models released publicly.

Significance. If the empirical claims hold, the work has moderate significance for AG-ReID by extending conditional diffusion models to synthesize view-specific features while preserving identity discrimination, moving beyond purely discriminative view-robust designs. The public release of source code and pre-trained models at the cited GitHub repository strengthens reproducibility and allows direct verification of the pipeline.

major comments (2)
  1. [Abstract] Abstract: The central claim that the method 'demonstrate[s] the effectiveness' on five benchmarks rests on experimental outcomes, yet the manuscript text supplies no quantitative results, performance tables, ablation studies, or error analysis. Without these, the improvement over prior discriminative models cannot be assessed and is load-bearing for the contribution.
  2. [Method] Method (View-Refined Decoder description): The VRD is asserted to successfully bridge instance-level and global-level features without introducing inconsistencies, but no architecture diagram, equations for feature fusion, or training objective for the decoder are provided. This leaves the weakest assumption unverified and directly affects whether the combined representations improve rather than degrade discrimination.
minor comments (2)
  1. [Introduction] The transition in the introduction from limitations of prior work to the proposed generative approach could be tightened for clarity.
  2. Notation for the controllable conditions (identity and view) extracted by the ViT could be formalized with explicit symbols to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and completeness while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the method 'demonstrate[s] the effectiveness' on five benchmarks rests on experimental outcomes, yet the manuscript text supplies no quantitative results, performance tables, ablation studies, or error analysis. Without these, the improvement over prior discriminative models cannot be assessed and is load-bearing for the contribution.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript contains detailed performance tables, ablation studies, and comparisons in the Experiments section. In the revision, we will update the abstract to summarize the main empirical gains (e.g., average Rank-1 improvements across the five benchmarks) so the effectiveness claim is directly supported. revision: yes

  2. Referee: [Method] Method (View-Refined Decoder description): The VRD is asserted to successfully bridge instance-level and global-level features without introducing inconsistencies, but no architecture diagram, equations for feature fusion, or training objective for the decoder are provided. This leaves the weakest assumption unverified and directly affects whether the combined representations improve rather than degrade discrimination.

    Authors: We acknowledge that the current description of the View-Refined Decoder would benefit from additional technical detail. We will add an architecture diagram, explicit equations for the instance-to-global feature fusion, and the precise training objective for the decoder in the revised Method section. This will allow readers to verify that the fusion improves rather than degrades discrimination. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a procedural pipeline: train a ViT-based extractor for identity and view conditions, fine-tune Stable Diffusion under those conditions, insert a View-Refined Decoder, and combine instance- and global-level features for retrieval. All performance claims are obtained by standard supervised training and evaluation on five external public benchmarks (CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR, G2APS-ReID). No equation equates a claimed improvement to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via self-citation. The derivation therefore remains self-contained against external data and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on one new architectural component and standard deep-learning assumptions about diffusion-model fine-tuning; no additional free parameters or invented physical entities are introduced beyond the decoder.

free parameters (1)
  • hyperparameters for ViT and Stable Diffusion fine-tuning
    Standard training choices that are selected to optimize performance on the target benchmarks.
axioms (1)
  • domain assumption Stable Diffusion can be fine-tuned to synthesize view-specific feature distributions when conditioned on identity and view signals.
    Invoked when describing the fine-tuning stage that mimics different camera viewpoints.
invented entities (1)
  • View-Refined Decoder (VRD) no independent evidence
    purpose: Bridge instance-level and global-level features
    New module introduced to connect per-person and view-wide representations.

pith-pipeline@v0.9.0 · 5796 in / 1439 out tokens · 74387 ms · 2026-05-22T19:31:22.942046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 7 internal anchors

  1. [1]

    Illumination-invariant person re-identification,

    Y . Huang, Z.-J. Zha, X. Fu, and W. Zhang, “Illumination-invariant person re-identification,” inACMMM, 2019, pp. 365–373

  2. [2]

    Multi-scale learning for low-resolution person re-identification,

    X. Li, W.-S. Zheng, X. Wang, T. Xiang, and S. Gong, “Multi-scale learning for low-resolution person re-identification,” inICCV, 2015, pp. 3765–3773

  3. [3]

    Adversarially occluded samples for person re-identification,

    H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially occluded samples for person re-identification,” inCVPR, 2018, pp. 5098–5107

  4. [4]

    Aerial-ground person re-id,

    H. Nguyen, K. Nguyen, S. Sridharan, and C. Fookes, “Aerial-ground person re-id,” inICME, 2023, pp. 2585–2590

  5. [5]

    View-decoupled transformer for person re-identification under aerial-ground camera network,

    Q. Zhang, L. Wang, V . M. Patel, X. Xie, and J. Lai, “View-decoupled transformer for person re-identification under aerial-ground camera network,” inCVPR, 2024, pp. 22 000–22 009

  6. [6]

    Ag-reid. v2: Bridging aerial and ground views for person re-identification,

    H. Nguyen, K. Nguyen, S. Sridharan, and C. Fookes, “Ag-reid. v2: Bridging aerial and ground views for person re-identification,”TIFS, pp. 2896 – 2908, 2024

  7. [7]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695

  8. [8]

    Diffusiondet: Diffusion model for object detection,

    S. Chen, P. Sun, Y . Song, and P. Luo, “Diffusiondet: Diffusion model for object detection,” inICCV, 2023, pp. 19 830–19 843

  9. [9]

    A generalist framework for panoptic segmentation of images and videos,

    T. Chen, L. Li, S. Saxena, G. Hinton, and D. J. Fleet, “A generalist framework for panoptic segmentation of images and videos,” inICCV, 2023, pp. 909–919

  10. [10]

    Deep metric learning for person re-identification,

    D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” inICPR, 2014, pp. 34–39

  11. [11]

    Omni-scale feature learning for person re-identification,

    K. Zhou, Y . Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” inICCV, 2019, pp. 3702–3712

  12. [12]

    Auto-reid: Searching for a part-aware convnet for person re-identification,

    R. Quan, X. Dong, Y . Wu, L. Zhu, and Y . Yang, “Auto-reid: Searching for a part-aware convnet for person re-identification,” inICCV, 2019, pp. 3750–3759

  13. [13]

    Transreid: Transformer-based object re-identification,

    S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid: Transformer-based object re-identification,” inICCV, 2021, pp. 15 013– 15 022

  14. [14]

    Clip-reid: exploiting vision-language model for image re-identification without concrete text labels,

    S. Li, L. Sun, and Q. Li, “Clip-reid: exploiting vision-language model for image re-identification without concrete text labels,” inAAAI, vol. 37, no. 1, 2023, pp. 1405–1413. IEEE TRANSACTIONS ON IMAGE PROCESSING 11

  15. [15]

    Rgb-infrared cross-modality person re-identification,

    A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross-modality person re-identification,” inICCV, 2017, pp. 5380–5389

  16. [16]

    Hierarchical discriminative learning for visible thermal person re-identification,

    M. Ye, X. Lan, J. Li, and P. Yuen, “Hierarchical discriminative learning for visible thermal person re-identification,” inAAAI, vol. 32, no. 1, 2018

  17. [17]

    Learning progressive modality-shared transformers for effective visible-infrared person re-identification,

    H. Lu, X. Zou, and P. Zhang, “Learning progressive modality-shared transformers for effective visible-infrared person re-identification,” in AAAI, vol. 37, no. 2, 2023, pp. 1835–1843

  18. [18]

    Top-reid: Multi-spectral object re-identification with token permutation,

    Y . Wang, X. Liu, P. Zhang, H. Lu, Z. Tu, and H. Lu, “Top-reid: Multi-spectral object re-identification with token permutation,” inAAAI, vol. 38, no. 6, 2024, pp. 5758–5766

  19. [19]

    Magic tokens: Select diverse tokens for multi-modal object re-identification,

    P. Zhang, Y . Wang, Y . Liu, Z. Tu, and H. Lu, “Magic tokens: Select diverse tokens for multi-modal object re-identification,” inCVPR, 2024, pp. 17 117–17 126

  20. [20]

    Mam- bapro: Multi-modal object re-identification with mamba aggregation and synergistic prompt,

    Y . Wang, X. Liu, T. Yan, Y . Liu, A. Zheng, P. Zhang, and H. Lu, “Mam- bapro: Multi-modal object re-identification with mamba aggregation and synergistic prompt,” inAAAI, vol. 39, no. 8, 2025, pp. 8150–8158

  21. [21]

    Decoupled feature-based mixture of experts for multi-modal object re-identification,

    Y . Wang, Y . Liu, A. Zheng, and P. Zhang, “Decoupled feature-based mixture of experts for multi-modal object re-identification,” inAAAI, vol. 39, no. 8, 2025, pp. 8141–8149

  22. [22]

    Idea: Inverted text with cooper- ative deformable aggregation for multi-modal object re-identification,

    Y . Wang, Y . Lv, P. Zhang, and H. Lu, “Idea: Inverted text with cooper- ative deformable aggregation for multi-modal object re-identification,” inCVPR, 2025, pp. 29 701–29 710

  23. [23]

    Secap: Self- calibrating and adaptive prompts for cross-view person re-identification in aerial-ground networks,

    S. Wang, Y . Wang, R. Wu, B. Jiao, W. Wang, and P. Wang, “Secap: Self- calibrating and adaptive prompts for cross-view person re-identification in aerial-ground networks,” inCVPR, 2025, pp. 22 119–22 128

  24. [24]

    Cross-platform video person reid: A new benchmark dataset and adaptation approach,

    S. Zhang, W. Luo, D. Cheng, Q. Yang, L. Ran, Y . Xing, and Y . Zhang, “Cross-platform video person reid: A new benchmark dataset and adaptation approach,” inECCV, 2024, pp. 270–287

  25. [25]

    Detreidx: A stress-test dataset for real-world uav-based person recognition,

    K. A. Hambarde, N. Mbongo, P. K. MP, S. Mekewad, C. Fernandes, G. Silahtaro ˘glu, A. Nithya, P. Wasnik, M. Rashidunnabi, P. Samale et al., “Detreidx: A stress-test dataset for real-world uav-based person recognition,”arXiv preprint arXiv:2505.04793, 2025

  26. [26]

    Multi-modal multi-platform person re-identification: Benchmark and method,

    R. Ha, S. Jiang, B. Li, B. Pan, Y . Zhu, J. Zhang, X. Zhu, S. Gong, and J. Wang, “Multi-modal multi-platform person re-identification: Benchmark and method,”arXiv preprint arXiv:2503.17096, 2025

  27. [27]

    Ag-vpreid: A challenging large-scale benchmark for aerial-ground video-based person re-identification,

    H. Nguyen, K. Nguyen, A. Pemasiri, F. Liu, S. Sridharan, and C. Fookes, “Ag-vpreid: A challenging large-scale benchmark for aerial-ground video-based person re-identification,” inCVPR, 2025, pp. 1241–1251

  28. [28]

    Ag-vpreid. vir: Bridging aerial and ground platforms for video- based visible-infrared person re-id,

    H. Nguyen, K. Nguyen, A. Pemasiri, A. Jahan, C. Fookes, and S. Srid- haran, “Ag-vpreid. vir: Bridging aerial and ground platforms for video- based visible-infrared person re-id,”arXiv preprint arXiv:2507.17995, 2025

  29. [29]

    Dynamic token selective transformer for aerial-ground person re-identification,

    Y . Wang and M. Pishgar, “Dynamic token selective transformer for aerial-ground person re-identification,”arXiv preprint arXiv:2412.00433v2, 2024

  30. [30]

    Latex: Leveraging attribute- based text knowledge for aerial-ground person re-identification,

    X. Hu, Y . Wang, P. Zhang, and H. Lu, “Latex: Leveraging attribute- based text knowledge for aerial-ground person re-identification,”arXiv preprint arXiv:2503.23722, 2025

  31. [31]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

  32. [32]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

  33. [33]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021

  34. [34]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

  35. [35]

    Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents,

    K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, “Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents,” arXiv preprint arXiv:2201.00308, 2022

  36. [36]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”NeurIPS, vol. 35, pp. 36 479–36 494, 2022

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  38. [38]

    Cascaded diffusion models for high fidelity image generation,

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,”JMLR, vol. 23, no. 47, pp. 1–33, 2022

  39. [39]

    Feature erasing and diffusion network for occluded person re-identification,

    Z. Wang, F. Zhu, S. Tang, R. Zhao, L. He, and J. Song, “Feature erasing and diffusion network for occluded person re-identification,” inCVPR, 2022, pp. 4754–4763

  40. [40]

    Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

    I. H. Kim, J. Lee, W. Jin, S. Son, K. Cho, J. Seo, M.-S. Kwak, S. Cho, J. Baek, B. Leeet al., “Pose-dive: Pose-diversified augmen- tation with diffusion model for person re-identification,”arXiv preprint arXiv:2406.16042, 2024

  41. [41]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inCVPR, 2016, pp. 2818–2826

  42. [42]

    In Defense of the Triplet Loss for Person Re-Identification

    A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,”arXiv preprint arXiv:1703.07737, 2017

  43. [43]

    Coarse-to-fine latent diffusion for pose-guided person image synthesis,

    Y . Lu, M. Zhang, A. J. Ma, X. Xie, and J. Lai, “Coarse-to-fine latent diffusion for pose-guided person image synthesis,” inCVPR, 2024, pp. 6420–6429

  44. [44]

    Fastreid: A pytorch toolbox for general instance re-identification,

    L. He, X. Liao, W. Liu, X. Liu, P. Cheng, and T. Mei, “Fastreid: A pytorch toolbox for general instance re-identification,” inACMMM, 2023, pp. 9664–9667

  45. [45]

    Learning part-based convolutional features for person re-identification,

    Y . Sun, L. Zheng, Y . Li, Y . Yang, Q. Tian, and S. Wang, “Learning part-based convolutional features for person re-identification,”TPAMI, vol. 43, no. 3, pp. 902–917, 2019

  46. [46]

    Bag of tricks and a strong baseline for deep person re-identification,

    H. Luo, Y . Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a strong baseline for deep person re-identification,” inCVPR workshops, 2019, pp. 0–0

  47. [47]

    Learning discriminative features with multiple granularities for person re-identification,

    G. Wang, Y . Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in ACMMM, 2018, pp. 274–282

  48. [48]

    A strong and efficient baseline for vehicle re-identification using deep triplet embedding,

    R. Kumar, E. Weill, F. Aghdasi, and P. Sriram, “A strong and efficient baseline for vehicle re-identification using deep triplet embedding,” JAISCR, vol. 10, no. 1, pp. 27–45, 2020

  49. [49]

    Deep learning for person re-identification: A survey and outlook,

    M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”TPAMI, vol. 44, no. 6, pp. 2872–2893, 2021

  50. [50]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  51. [51]

    Learning generalisable omni-scale representations for person re-identification,

    K. Zhou, Y . Yang, A. Cavallaro, and T. Xiang, “Learning generalisable omni-scale representations for person re-identification,”TPAMI, vol. 44, no. 9, pp. 5056–5069, 2021

  52. [52]

    Unity is strength: Unifying convolutional and transformeral features for better person re- identification,

    Y . Wang, P. Zhang, X. Liu, Z. Tu, and H. Lu, “Unity is strength: Unifying convolutional and transformeral features for better person re- identification,”TITS, 2025

  53. [53]

    Prototypical contrastive learning-based clip fine- tuning for object re-identification,

    J. Li and X. Gong, “Prototypical contrastive learning-based clip fine- tuning for object re-identification,”arXiv preprint arXiv:2310.17218, 2023

  54. [54]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV, 2021, pp. 10 012–10 022

  55. [55]

    Deep high-resolution representation learning for visual recognition,

    J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y . Zhao, D. Liu, Y . Mu, M. Tan, X. Wanget al., “Deep high-resolution representation learning for visual recognition,”TPAMI, vol. 43, no. 10, pp. 3349–3364, 2020

  56. [56]

    Swin transformer v2: Scaling up capacity and resolution,

    Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Donget al., “Swin transformer v2: Scaling up capacity and resolution,” inCVPR, 2022, pp. 12 009–12 019

  57. [57]

    Ag-reid 2023: Aerial-ground person re-identification challenge results,

    K. Nguyen, C. Fookes, S. Sridharan, F. Liu, X. Liu, A. Ross, D. Michal- ski, H. Nguyen, D. Deb, M. Kothariet al., “Ag-reid 2023: Aerial-ground person re-identification challenge results,” inIJCB, 2023, pp. 1–10

  58. [58]

    Enhancing visible- infrared person re-identification with modality-and instance-aware visual prompt learning,

    R. Wu, B. Jiao, W. Wang, M. Liu, and P. Wang, “Enhancing visible- infrared person re-identification with modality-and instance-aware visual prompt learning,” inICMR, 2024, pp. 579–588

  59. [59]

    Ground-to-aerial person search: Benchmark dataset and approach,

    S. Zhang, Q. Yang, D. Cheng, Y . Xing, G. Liang, P. Wang, and Y . Zhang, “Ground-to-aerial person search: Benchmark dataset and approach,” in ACM MM, 2023, pp. 789–799

  60. [60]

    Computational and performance aspects of pca-based face-recognition algorithms,

    H. Moon and P. J. Phillips, “Computational and performance aspects of pca-based face-recognition algorithms,”Perception, vol. 30, no. 3, pp. 303–321, 2001

  61. [61]

    Scalable person re-identification: A benchmark,

    L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” inICCV, 2015, pp. 1116–1124

  62. [62]

    Diffusers: State-of-the-art diffusion mod- els,

    P. V on Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, and T. Wolf, “Diffusers: State-of-the-art diffusion mod- els,” 2022

  63. [63]

    Random erasing data augmentation,

    Z. Zhong, L. Zheng, G. Kang, S. Li, and Y . Yang, “Random erasing data augmentation,” inAAAI, vol. 34, no. 07, 2020, pp. 13 001–13 008

  64. [64]

    Large-scale machine learning with stochastic gradient de- scent,

    L. Bottou, “Large-scale machine learning with stochastic gradient de- scent,” inICCS, 2010, pp. 177–186

  65. [65]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  66. [66]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”JMLR, vol. 9, no. 11, 2008