pith. sign in

arxiv: 2507.17640 · v3 · pith:EQ557KZDnew · submitted 2025-07-23 · 💻 cs.CV

Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

Pith reviewed 2026-05-22 12:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords person re-identificationpre-trained modelsfoundation modelsdomain adaptationfine-tuningBayesian priorscomputer visiontransfer learning
0
0 comments X

The pith

Large pre-trained foundation models reach state-of-the-art person re-identification performance through simple fine-tuning that leaves solutions close to their initial weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the choice of starting model creates large differences in final accuracy on person re-identification benchmarks when the adaptation steps are held fixed. It treats the pre-trained weights as a prior that shapes the outcome of later training and frames the adapted solution as a high-probability point in the Gibbs posterior. Using this view, the authors obtain top results on Market, PRCC, DeepChange, and BTS by starting from models such as CLIP, Dino, EVA, and AIM and applying only modest domain adaptation. They further find that these high-performing solutions remain near the original parameter values and can be obtained with small transfer sets, though they depend strongly on optimizer choice, weight decay, and loss function.

Core claim

Under equated domain adaptation pipelines, pre-trained weights function as a strong prior; large foundation models therefore yield state-of-the-art re-identification accuracy on Market, PRCC, DeepChange, and BTS while the final weights stay close in parameter space to the starting values.

What carries the argument

Pre-trained weights acting as the prior in a maximum-probability point estimate of the Gibbs posterior under fixed domain-adaptation steps.

If this is right

  • Large foundation models with direct fine-tuning set new performance levels on the listed re-id datasets.
  • High-performing solutions lie close in parameter space to the original pre-trained weights.
  • Comparable accuracy is reachable with small transfer sets and with different transfer datasets.
  • Results are sensitive to optimizer, weight-decay value, and loss function.
  • Direct fine-tuning of large vision foundation models should become a standard baseline in future re-id studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-strength argument may apply to other transfer-learning settings where adaptation data are limited.
  • Measuring Euclidean or cosine distance in weight space could serve as a cheap diagnostic for how much a given pre-training run helps a downstream task.
  • Future work could test whether deliberately moving the starting weights farther from the pre-trained point reduces final accuracy under the same adaptation budget.

Load-bearing premise

The domain adaptation pipelines are kept identical across every starting model so that performance gaps can be attributed directly to differences in the pre-trained weights.

What would settle it

Run the identical adaptation pipeline on several foundation models and measure whether the ranking of final accuracies remains stable or collapses when the pipelines are allowed to differ.

Figures

Figures reproduced from arXiv: 2507.17640 by Alice J. O'Toole, Matthew Q. Hill, Thomas M. Metz.

Figure 2
Figure 2. Figure 2: ECHO-BID(model 10) is substantially more robust to [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: ECHO-BID(model 10) is substantially more robust to [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Recent years have seen an explosion of diverse general purpose pre-training methodologies for computer vision. However, the impact that these pre-training methodologies have on person identification tasks (re-id) remains under-explored. We show that under equated domain adaptation pipelines, there is dramatic variance in person identification outcomes using different starting models (architectures and pre-trained weights). We show that a range of intuitive explanations for differing downstream performance on a range of re-id tests are insufficient and propose that pre-trained weights serve as a strong prior to the weights learned during domain adaptation. This framework allows for domain adapted solutions to be viewed as a maximum probability point estimate of the Gibbs posterior with the pre-trained weights acting as a prior. Under this framework, we show that large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets (Market, PRCC, DeepChange, BTS) with solutions that are very close in the parameter space to the starting parameters. Moreover, we perform ablations on these solutions and show that they can be reached with small transfer sets and with varying transfer datasets but are sensitive to choice of optimizer, weight-decay, and loss function. Ultimately, we propose that the simple approach of direct fine-tuning using large vision foundation models (CLIP, Dino, EVA, AIM, etc.) needs to serve as an important baseline for future work in re-id.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an empirical study on the impact of different pre-trained vision models on person re-identification (re-id) tasks. It argues that under equated domain adaptation pipelines, there is dramatic variance in performance across starting models (e.g., CLIP, DINO, EVA, AIM). Intuitive explanations for these differences are deemed insufficient, and instead, pre-trained weights are proposed to act as strong priors. This is framed using the Gibbs posterior, where domain-adapted solutions are maximum probability point estimates. The paper reports that large foundation models achieve SOTA performance on re-id datasets such as Market, PRCC, DeepChange, and BTS, with adapted parameters remaining close to the initial ones. Ablations indicate that these solutions can be reached with small transfer sets and varying datasets but are sensitive to optimizer, weight-decay, and loss function choices.

Significance. Should the results be confirmed, this paper makes a valuable contribution by highlighting the outsized influence of pre-trained priors in re-id and recommending that simple fine-tuning of large models serve as a strong baseline for future work. The Gibbs posterior framing provides an interesting interpretive tool, and the empirical demonstrations on multiple datasets with ablations add to the evidence base. This could encourage the community to focus more on initialization effects rather than solely on novel adaptation techniques.

major comments (1)
  1. [Abstract and Experimental Setup] The equivalence of the domain adaptation pipelines across different starting models is load-bearing for the central claim that performance differences are due to the pre-trained priors. The abstract states that results hold 'under equated domain adaptation pipelines' and reports sensitivity to optimizer, weight-decay, and loss function. However, it is not clear whether other key hyperparameters (learning rate schedules, epoch counts, augmentation strength) were held strictly fixed for all initializations or re-optimized per model. If a single fixed recipe was applied without per-model tuning, superior performance for certain models (e.g., CLIP vs. EVA) may reflect better alignment with that recipe rather than prior strength alone. Explicit confirmation and a table listing the shared hyperparameter values used for every starting model are required to support the attribution.
minor comments (2)
  1. The abstract refers to 'a range of intuitive explanations' being insufficient; listing the specific explanations considered (and why they fail) in the introduction or related work section would improve transparency.
  2. [Ablations] The statement that solutions are 'very close in the parameter space to the starting parameters' would be strengthened by reporting a quantitative metric such as mean L2 distance or cosine similarity between initial and final weights, ideally in a results table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the experimental details supporting our central claims. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and Experimental Setup] The equivalence of the domain adaptation pipelines across different starting models is load-bearing for the central claim that performance differences are due to the pre-trained priors. The abstract states that results hold 'under equated domain adaptation pipelines' and reports sensitivity to optimizer, weight-decay, and loss function. However, it is not clear whether other key hyperparameters (learning rate schedules, epoch counts, augmentation strength) were held strictly fixed for all initializations or re-optimized per model. If a single fixed recipe was applied without per-model tuning, superior performance for certain models (e.g., CLIP vs. EVA) may reflect better alignment with that recipe rather than prior strength alone. Explicit confirmation and a table listing the shared hyperparameter values used for every starting model are required to s

    Authors: We confirm that a single fixed hyperparameter recipe was used uniformly across all starting models (CLIP, DINO, EVA, AIM, etc.) with no per-model re-optimization of learning rate schedules, epoch counts, or augmentation strength. This fixed recipe was applied to isolate the effect of the pre-trained priors as the source of performance variance. The sensitivities to optimizer, weight-decay, and loss function noted in the abstract were explored in dedicated ablation studies (where those elements were varied while holding the rest of the pipeline fixed). To make the equivalence explicit, we will add a table in the revised manuscript listing all shared hyperparameter values applied to every initialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on direct empirical comparisons of performance variance and parameter-space proximity across different pre-trained initializations (CLIP, DINO, EVA, etc.) under a single fixed domain-adaptation recipe on multiple re-id benchmarks. These outcomes are measured quantities, not quantities derived from the Gibbs-posterior framing. The posterior view is explicitly offered as an interpretive lens for the observed closeness of adapted solutions to starting weights rather than a mathematical step that presupposes or constructs those measurements. No equation or claim reduces the reported SOTA results, ablation findings, or sensitivity analyses to a fitted parameter renamed as a prediction or to a self-referential definition. The derivation chain is therefore self-contained against the external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that adaptation pipelines can be held constant; no new physical entities are introduced and no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Domain adaptation pipelines can be equated across different pre-trained starting models for fair comparison
    This premise is required to isolate the effect of pre-trained weights as the source of performance variance.

pith-pipeline@v0.9.0 · 5793 in / 1502 out tokens · 46373 ms · 2026-05-22T12:51:54.735645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

  1. [1]

    Foundation models defining a new era in vision: A survey and outlook

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264,

  2. [2]

    Cloth-changing person re-identification with self-attention

    Vaibhav Bansal, Gian Luca Foresti, and Niki Mar- tinel. Cloth-changing person re-identification with self-attention. In 2022 IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops (WACVW), pages 602–610, 2022. 2

  3. [3]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In 2021 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 3557–3567, 2021. 4

  4. [4]

    Occlude them all: Occlusion- aware attention network for occluded person re-id

    Peixian Chen, Wenfeng Liu, Pingyang Dai, Jianzhuang Liu, Qixiang Ye, Mingliang Xu, Qi’an Chen, and Rongrong Ji. Occlude them all: Occlusion- aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11833–11842, 2021. 3

  5. [5]

    Oc4-reid: Occluded cloth- changing person re-identification, 2024

    Zhihao Chen, Yiyuan Ge, Ziyang Wang, Jiaju Kang, and Mingya Zhang. Oc4-reid: Occluded cloth- changing person re-identification, 2024. 8

  6. [6]

    Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset

    David Cornett, Joel Brogan, Nell Barber, Deniz Aykac, Seth Baird, Nicholas Burchfield, Carl Dukes, Andrew Duncan, Regina Ferrell, Jim Goddard, et al. Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 593–602, 2023. 1, 2

  7. [7]

    Dauphin, Angela Fan, Michael Auli, and David Grangier

    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated con- volutional networks, 2017. 3

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 2

  9. [9]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023. 3

  10. [10]

    Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024. 1, 2, 3, 4

  11. [11]

    Unsupervised pre-training for person re- identification, 2021

    Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsupervised pre-training for person re- identification, 2021. 3

  12. [12]

    Aonet: attentional occlusion-aware network for occluded person re-identification

    Guangyu Gao, Qianxiang Wang, Jing Ge, and Yan Zhang. Aonet: attentional occlusion-aware network for occluded person re-identification. In Proceedings of the Asian conference on computer vision , pages 1606–1621, 2022. 3

  13. [13]

    Understanding the difficulty of training deep feedforward neural net- works

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net- works. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages 249–256, Chia Laguna Resort, Sardinia, Italy,

  14. [14]

    X. Gu, H. Chang, B. Ma, S. Bai, S. Shan, and X. Chen. Clothes-changing person re-identification with rgb modality only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1060–1069, 2022. 2

  15. [15]

    Clothes-changing person re-identification with rgb modality only, 2022

    Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only, 2022. 3, 5, 7

  16. [16]

    Dissecting the time course of person recogni- tion in natural viewing environments

    Carina A Hahn, Alice J O’Toole, and P Jonathon Phillips. Dissecting the time course of person recogni- tion in natural viewing environments. British Journal of Psychology, 107(1):117–134, 2016. 1

  17. [17]

    Clothing-change feature augmenta- tion for person re-identification

    Ke Han, Shaogang Gong, Yan Huang, Liang Wang, and Tieniu Tan. Clothing-change feature augmenta- tion for person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22066–22075, 2023. 2

  18. [18]

    Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

    Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai Xiang. Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

  19. [19]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition,

  20. [20]

    Transreid: Transformer-based ob- ject re-identification

    Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based ob- ject re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15013–15022, 2021. 2

  21. [21]

    Gaussian error lin- ear units (gelus), 2023

    Dan Hendrycks and Kevin Gimpel. Gaussian error lin- ear units (gelus), 2023. 3

  22. [22]

    Rotary position embedding for vision trans- former, 2024

    Byeongho Heo, Song Park, Dongyoon Han, and Sang- doo Yun. Rotary position embedding for vision trans- former, 2024. 3

  23. [23]

    Whole- body detection, identification and recognition at alti- tude and range

    Siyuan Huang, Ram Prabhakar Kathirvel, Yuxiang Guo, Chun Pong Lau, and Rama Chellappa. Whole- body detection, identification and recognition at alti- tude and range. IEEE Transactions on Biometrics, Be- havior, and Identity Science, 2024. 2

  24. [24]

    Vills – video- image learning to learn semantics for person re- identification, 2024

    Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, and Cheng Peng. Vills – video- image learning to learn semantics for person re- identification, 2024. 3, 4, 5, 6, 7

  25. [25]

    Huang, Q

    Y . Huang, Q. Wu, J. Xu, and Y . Zhong. Celebrities- reid: A benchmark for clothes variation in long-term person re-identification. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019. 4

  26. [26]

    Clothing status awareness for long-term person re-identification

    Yan Huang, Qiang Wu, JingSong Xu, Yi Zhong, and ZhaoXiang Zhang. Clothing status awareness for long-term person re-identification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11875–11884, 2021. 2

  27. [27]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything,

  28. [28]

    The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices

    SV Aruna Kumar, Ehsan Yaghoubi, Abhijit Das, BS Harish, and Hugo Proenc ¸a. The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices. IEEE Transactions on Information Forensics and Se- curity, 16:1696–1708, 2020. 2

  29. [29]

    The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020. 4

  30. [30]

    Attribute de-biased vision transformer (ad-vit) for long-term person re-identification

    Kyung Won Lee, Bhavin Jawade, Deen Mohan, Sri- rangaraj Setlur, and Venu Govindaraju. Attribute de-biased vision transformer (ad-vit) for long-term person re-identification. In 2022 18th IEEE Inter- national Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8, 2022. 2

  31. [31]

    Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023

    Siyuan Li, Li Sun, and Qingli Li. Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023. 3

  32. [32]

    Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024

    Shuang Li, Jiaxu Leng, Guozhang Li, Ji Gan, Haosheng chen, and Xinbo Gao. Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024. 3

  33. [33]

    Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles

    Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wen- qian Wang, and Zhiheng Li. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 16261–16270, 2021. 2

  34. [34]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Computer Vision – ECCV 2014 , pages 740–755, Cham, 2014. Springer International Publishing. 4

  35. [35]

    Distilling clip with dual guidance for learning discriminative human body shape representation

    Feng Liu, Minchul Kim, Zhiyuan Ren, and Xiaoming Liu. Distilling clip with dual guidance for learning discriminative human body shape representation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 256–266, 2024. 3

  36. [36]

    Swin transformer: Hierarchical vision transformer using shifted windows, 2021

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. 1, 2

  37. [37]

    Self- supervised pre-training for transformer-based person re-identification, 2021

    Hao Luo, Pichao Wang, Yi Xu, Feng Ding, Yanxin Zhou, Fan Wang, Hao Li, and Rong Jin. Self- supervised pre-training for transformer-based person re-identification, 2021. 2

  38. [38]

    Subject identification up to 1km: Performer perspective on the iarpa briar program

    Scott McCloskey, Brandon RichardWebster, Roddy Collins, and Anthony Hoogs. Subject identification up to 1km: Performer perspective on the iarpa briar program. Proceedings of the National Security Sensor and Data Fusion Committee (NSSDF), 2023. 2

  39. [39]

    Dissecting human body representations in deep networks trained for person identification, 2025

    Thomas M Metz, Matthew Q Hill, Blake Myers, Veda Nandan Gandi, Rahul Chilakapati, and Alice J O’Toole. Dissecting human body representations in deep networks trained for person identification, 2025. 2, 3, 8

  40. [40]

    Myers, Lucas Jaggernauth, Thomas M

    Blake A. Myers, Lucas Jaggernauth, Thomas M. Metz, Matthew Q. Hill, Veda Nandan Gandi, Car- los D. Castillo, and Alice J. O’Toole. Recognizing people by body shape using deep networks of images and words. Proceedings of the IEEE: International Joint Conference on Biometrics, 2023. 2 10

  41. [41]

    Unconstrained body recognition at altitude and range: Comparing four approaches, 2025

    Blake A Myers, Matthew Q Hill, Veda Nandan Gandi, Thomas M Metz, and Alice J O’Toole. Unconstrained body recognition at altitude and range: Comparing four approaches, 2025. 1, 2, 3, 4, 6, 7

  42. [42]

    Masked attribute description embedding for cloth-changing person re- identification, 2024

    Chunlei Peng, Boyu Wang, Decheng Liu, Nannan Wang, Ruimin Hu, and Xinbo Gao. Masked attribute description embedding for cloth-changing person re- identification, 2024. 4

  43. [43]

    Long-term cloth-changing person re- identification, 2020

    Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xi- angyang Xue. Long-term cloth-changing person re- identification, 2020. 2

  44. [44]

    Learning trans- ferable visual models from natural language supervi- sion, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion, 2021. 3

  45. [45]

    Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2017. 3

  46. [46]

    Imagenet-21k pretraining for the masses, 2021

    Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021. 4

  47. [47]

    Imagenet large scale visual recognition chal- lenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition chal- lenge. International journal of computer vision , 115: 211–252, 2015. 2, 4

  48. [48]

    Ob- jects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob- jects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV) , pages 8429–8438,

  49. [49]

    Kapil, and David Chap- man

    Charu Sharma, Siddhant R. Kapil, and David Chap- man. Person re-identification with a locally aware transformer, 2021. 2

  50. [50]

    Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers) , pages 2556–2565, Melbourne, Australia, 2018. Association for Compu- tatio...

  51. [51]

    Glu variants improve transformer,

    Noam Shazeer. Glu variants improve transformer,

  52. [52]

    X. Shu, X. Wang, X. Zang, S. Zhang, Y . Chen, G. Li, and Q. Tian. Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Tech- nology, 32(7):4390–4403, 2021. 4

  53. [53]

    Body part-based representation learning for occluded person re-identification

    Vladimir Somers, Christophe De Vleeschouwer, and Alexandre Alahi. Body part-based representation learning for occluded person re-identification. In Pro- ceedings of the IEEE/CVF winter conference on appli- cations of computer vision, pages 1613–1623, 2023. 3

  54. [54]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 3

  55. [55]

    Eva-clip: Improved training techniques for clip at scale, 2023

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023. 3

  56. [56]

    Foundation transformers, 2022

    Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Pa- tra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers, 2022. 3

  57. [57]

    A benchmark for clothes variation in person re-identification

    Kai Wang, Zhi Ma, Shiyan Chen, Jinni Yang, Keke Zhou, and Tao Li. A benchmark for clothes variation in person re-identification. International Journal of Intelligent Systems, 35(12):1881–1898, 2020. 2

  58. [58]

    Person transfer gan to bridge domain gap for person re-identification

    Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 79–88, 2018. 2

  59. [59]

    Revealing the dark se- crets of masked image modeling, 2022

    Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark se- crets of masked image modeling, 2022. 3

  60. [60]

    Deepchange: A large long- term person re-identification benchmark with clothes change, 2022

    Peng Xu and Xiatian Zhu. Deepchange: A large long- term person re-identification benchmark with clothes change, 2022. 6

  61. [61]

    Deepchange: A long- term person re-identification benchmark with clothes change

    Peng Xu and Xiatian Zhu. Deepchange: A long- term person re-identification benchmark with clothes change. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 11196– 11205, 2023. 2

  62. [62]

    Occluded person re- identification with single-scale global representations

    Cheng Yan, Guansong Pang, Jile Jiao, Xiao Bai, Xue- tao Feng, and Chunhua Shen. Occluded person re- identification with single-scale global representations. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11875–11884, 2021. 3

  63. [63]

    Per- son re-identification by contour sketch under moder- ate clothing change

    Qize Yang, Ancong Wu, and Wei-Shi Zheng. Per- son re-identification by contour sketch under moder- ate clothing change. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 6

  64. [64]

    Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification

    Zhengwei Yang, Meng Lin, Xian Zhong, Yu Wu, and Zheng Wang. Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1472–1481,

  65. [65]

    Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 44(6):2872–2893, 2022. 1

  66. [66]

    Cocas: A large-scale clothes chang- ing person dataset for re-identification

    Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, and Yu Qiao. Cocas: A large-scale clothes chang- ing person dataset for re-identification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3400–3409, 2020. 1

  67. [67]

    Hat: Hierarchical aggregation trans- formers for person re-identification

    Guowen Zhang, Pingping Zhang, Jinqing Qi, and Huchuan Lu. Hat: Hierarchical aggregation trans- formers for person re-identification. In Proceedings of the 29th ACM International Conference on Mul- timedia, page 516–525, New York, NY , USA, 2021. Association for Computing Machinery. 2

  68. [68]

    3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation

    Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, and Alan Yuille. 3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 9399– 9410, 2023. 3

  69. [69]

    Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025

    Huazhong Zhao, Lei Qi, and Xin Geng. Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025. 3

  70. [70]

    Scalable person re- identification: A benchmark

    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1116–1124, 2015. 6

  71. [71]

    Scalable person re- identification: A benchmark

    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In Proceedings of the IEEE international conference on computer vision , pages 1116–1124, 2015. 2

  72. [72]

    Mars: A video benchmark for large-scale person re-identification

    Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11- 14, 2016, Proceedings, Part VI 14 , pages 868–884. Springer, 2016. 2

  73. [73]

    Se- mantic understanding of scenes through the ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision , 127(3):302–321, 2019. 4

  74. [74]

    Sharc: Shape and appearance recogni- tion for person identification in-the-wild

    Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, and Ram Nevatia. Sharc: Shape and appearance recogni- tion for person identification in-the-wild. In Proceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6290–6300, 2024. 3

  75. [75]

    Occluded person re-identification

    Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guang- cong Wang. Occluded person re-identification. In 2018 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2018. 3 12