Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

Alice J. O'Toole; Matthew Q. Hill; Thomas M. Metz

arxiv: 2507.17640 · v3 · pith:EQ557KZDnew · submitted 2025-07-23 · 💻 cs.CV

Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

Thomas M. Metz , Matthew Q. Hill , Alice J. O'Toole This is my paper

Pith reviewed 2026-05-22 12:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords person re-identificationpre-trained modelsfoundation modelsdomain adaptationfine-tuningBayesian priorscomputer visiontransfer learning

0 comments

The pith

Large pre-trained foundation models reach state-of-the-art person re-identification performance through simple fine-tuning that leaves solutions close to their initial weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the choice of starting model creates large differences in final accuracy on person re-identification benchmarks when the adaptation steps are held fixed. It treats the pre-trained weights as a prior that shapes the outcome of later training and frames the adapted solution as a high-probability point in the Gibbs posterior. Using this view, the authors obtain top results on Market, PRCC, DeepChange, and BTS by starting from models such as CLIP, Dino, EVA, and AIM and applying only modest domain adaptation. They further find that these high-performing solutions remain near the original parameter values and can be obtained with small transfer sets, though they depend strongly on optimizer choice, weight decay, and loss function.

Core claim

Under equated domain adaptation pipelines, pre-trained weights function as a strong prior; large foundation models therefore yield state-of-the-art re-identification accuracy on Market, PRCC, DeepChange, and BTS while the final weights stay close in parameter space to the starting values.

What carries the argument

Pre-trained weights acting as the prior in a maximum-probability point estimate of the Gibbs posterior under fixed domain-adaptation steps.

If this is right

Large foundation models with direct fine-tuning set new performance levels on the listed re-id datasets.
High-performing solutions lie close in parameter space to the original pre-trained weights.
Comparable accuracy is reachable with small transfer sets and with different transfer datasets.
Results are sensitive to optimizer, weight-decay value, and loss function.
Direct fine-tuning of large vision foundation models should become a standard baseline in future re-id studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-strength argument may apply to other transfer-learning settings where adaptation data are limited.
Measuring Euclidean or cosine distance in weight space could serve as a cheap diagnostic for how much a given pre-training run helps a downstream task.
Future work could test whether deliberately moving the starting weights farther from the pre-trained point reduces final accuracy under the same adaptation budget.

Load-bearing premise

The domain adaptation pipelines are kept identical across every starting model so that performance gaps can be attributed directly to differences in the pre-trained weights.

What would settle it

Run the identical adaptation pipeline on several foundation models and measure whether the ranking of final accuracies remains stable or collapses when the pipelines are allowed to differ.

Figures

Figures reproduced from arXiv: 2507.17640 by Alice J. O'Toole, Matthew Q. Hill, Thomas M. Metz.

**Figure 1.** Figure 1: ECHO-BID(model 10) is substantially more robust to [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

read the original abstract

Recent years have seen an explosion of diverse general purpose pre-training methodologies for computer vision. However, the impact that these pre-training methodologies have on person identification tasks (re-id) remains under-explored. We show that under equated domain adaptation pipelines, there is dramatic variance in person identification outcomes using different starting models (architectures and pre-trained weights). We show that a range of intuitive explanations for differing downstream performance on a range of re-id tests are insufficient and propose that pre-trained weights serve as a strong prior to the weights learned during domain adaptation. This framework allows for domain adapted solutions to be viewed as a maximum probability point estimate of the Gibbs posterior with the pre-trained weights acting as a prior. Under this framework, we show that large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets (Market, PRCC, DeepChange, BTS) with solutions that are very close in the parameter space to the starting parameters. Moreover, we perform ablations on these solutions and show that they can be reached with small transfer sets and with varying transfer datasets but are sensitive to choice of optimizer, weight-decay, and loss function. Ultimately, we propose that the simple approach of direct fine-tuning using large vision foundation models (CLIP, Dino, EVA, AIM, etc.) needs to serve as an important baseline for future work in re-id.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Different pre-trained starting points drive large gaps in re-id performance even under matched adaptation, and simple fine-tuning of big models reaches SOTA while staying close to the initial weights.

read the letter

The main point is that starting model choice creates surprisingly large differences in person re-identification results, and that large foundation models with basic fine-tuning already hit strong numbers on Market, PRCC, DeepChange, and BTS while the final weights stay near the starting point. The authors treat the pre-trained weights as a prior and cast the adapted solution as a MAP estimate under a Gibbs posterior. They also report that the gains hold with small transfer sets but drop when optimizer, weight decay, or loss changes. This is the useful empirical core. The variance across CLIP, DINO, EVA and similar models is the clearest new observation, and the call to treat direct fine-tuning of these models as a required baseline is a practical takeaway for the re-id community. The ablations on transfer-set size add some concrete support. The soft spot is the claim that the domain-adaptation pipelines were fully equated. The stress-test note is on target here: if a single fixed recipe for learning rate, epochs, and augmentations was used without per-model retuning, then models whose inductive biases happen to fit that recipe will look stronger, and the performance gap gets partly attributed to recipe compatibility rather than prior strength alone. The abstract states the pipelines were equated, but the paper would be stronger with explicit confirmation that every hyperparameter was held constant or re-optimized independently. The Gibbs framing is an interpretive lens rather than a derivation that forces the empirical result, so it does not carry heavy weight. This work is for people doing person re-id or evaluating pre-trained vision models in applied settings. A reader who needs to choose or benchmark starting points will find the variance results and baseline recommendation directly useful. The empirical comparisons on multiple datasets are solid enough to justify sending the paper to referees, mainly to check the pipeline-equivalence details and to see whether the SOTA numbers hold under closer scrutiny.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an empirical study on the impact of different pre-trained vision models on person re-identification (re-id) tasks. It argues that under equated domain adaptation pipelines, there is dramatic variance in performance across starting models (e.g., CLIP, DINO, EVA, AIM). Intuitive explanations for these differences are deemed insufficient, and instead, pre-trained weights are proposed to act as strong priors. This is framed using the Gibbs posterior, where domain-adapted solutions are maximum probability point estimates. The paper reports that large foundation models achieve SOTA performance on re-id datasets such as Market, PRCC, DeepChange, and BTS, with adapted parameters remaining close to the initial ones. Ablations indicate that these solutions can be reached with small transfer sets and varying datasets but are sensitive to optimizer, weight-decay, and loss function choices.

Significance. Should the results be confirmed, this paper makes a valuable contribution by highlighting the outsized influence of pre-trained priors in re-id and recommending that simple fine-tuning of large models serve as a strong baseline for future work. The Gibbs posterior framing provides an interesting interpretive tool, and the empirical demonstrations on multiple datasets with ablations add to the evidence base. This could encourage the community to focus more on initialization effects rather than solely on novel adaptation techniques.

major comments (1)

[Abstract and Experimental Setup] The equivalence of the domain adaptation pipelines across different starting models is load-bearing for the central claim that performance differences are due to the pre-trained priors. The abstract states that results hold 'under equated domain adaptation pipelines' and reports sensitivity to optimizer, weight-decay, and loss function. However, it is not clear whether other key hyperparameters (learning rate schedules, epoch counts, augmentation strength) were held strictly fixed for all initializations or re-optimized per model. If a single fixed recipe was applied without per-model tuning, superior performance for certain models (e.g., CLIP vs. EVA) may reflect better alignment with that recipe rather than prior strength alone. Explicit confirmation and a table listing the shared hyperparameter values used for every starting model are required to support the attribution.

minor comments (2)

The abstract refers to 'a range of intuitive explanations' being insufficient; listing the specific explanations considered (and why they fail) in the introduction or related work section would improve transparency.
[Ablations] The statement that solutions are 'very close in the parameter space to the starting parameters' would be strengthened by reporting a quantitative metric such as mean L2 distance or cosine similarity between initial and final weights, ideally in a results table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the experimental details supporting our central claims. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and Experimental Setup] The equivalence of the domain adaptation pipelines across different starting models is load-bearing for the central claim that performance differences are due to the pre-trained priors. The abstract states that results hold 'under equated domain adaptation pipelines' and reports sensitivity to optimizer, weight-decay, and loss function. However, it is not clear whether other key hyperparameters (learning rate schedules, epoch counts, augmentation strength) were held strictly fixed for all initializations or re-optimized per model. If a single fixed recipe was applied without per-model tuning, superior performance for certain models (e.g., CLIP vs. EVA) may reflect better alignment with that recipe rather than prior strength alone. Explicit confirmation and a table listing the shared hyperparameter values used for every starting model are required to s

Authors: We confirm that a single fixed hyperparameter recipe was used uniformly across all starting models (CLIP, DINO, EVA, AIM, etc.) with no per-model re-optimization of learning rate schedules, epoch counts, or augmentation strength. This fixed recipe was applied to isolate the effect of the pre-trained priors as the source of performance variance. The sensitivities to optimizer, weight-decay, and loss function noted in the abstract were explored in dedicated ablation studies (where those elements were varied while holding the rest of the pipeline fixed). To make the equivalence explicit, we will add a table in the revised manuscript listing all shared hyperparameter values applied to every initialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on direct empirical comparisons of performance variance and parameter-space proximity across different pre-trained initializations (CLIP, DINO, EVA, etc.) under a single fixed domain-adaptation recipe on multiple re-id benchmarks. These outcomes are measured quantities, not quantities derived from the Gibbs-posterior framing. The posterior view is explicitly offered as an interpretive lens for the observed closeness of adapted solutions to starting weights rather than a mathematical step that presupposes or constructs those measurements. No equation or claim reduces the reported SOTA results, ablation findings, or sensitivity analyses to a fitted parameter renamed as a prediction or to a self-referential definition. The derivation chain is therefore self-contained against the external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that adaptation pipelines can be held constant; no new physical entities are introduced and no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Domain adaptation pipelines can be equated across different pre-trained starting models for fair comparison
This premise is required to isolate the effect of pre-trained weights as the source of performance variance.

pith-pipeline@v0.9.0 · 5793 in / 1502 out tokens · 46373 ms · 2026-05-22T12:51:54.735645+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets ... with solutions that are very close in the parameter space to the starting parameters
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pre-trained weights serve as a strong prior to the weights learned during domain adaptation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

Foundation models defining a new era in vision: A survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264,

work page
[2]

Cloth-changing person re-identification with self-attention

Vaibhav Bansal, Gian Luca Foresti, and Niki Mar- tinel. Cloth-changing person re-identification with self-attention. In 2022 IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops (WACVW), pages 602–610, 2022. 2

work page 2022
[3]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In 2021 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 3557–3567, 2021. 4

work page 2021
[4]

Occlude them all: Occlusion- aware attention network for occluded person re-id

Peixian Chen, Wenfeng Liu, Pingyang Dai, Jianzhuang Liu, Qixiang Ye, Mingliang Xu, Qi’an Chen, and Rongrong Ji. Occlude them all: Occlusion- aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11833–11842, 2021. 3

work page 2021
[5]

Oc4-reid: Occluded cloth- changing person re-identification, 2024

Zhihao Chen, Yiyuan Ge, Ziyang Wang, Jiaju Kang, and Mingya Zhang. Oc4-reid: Occluded cloth- changing person re-identification, 2024. 8

work page 2024
[6]

Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset

David Cornett, Joel Brogan, Nell Barber, Deniz Aykac, Seth Baird, Nicholas Burchfield, Carl Dukes, Andrew Duncan, Regina Ferrell, Jim Goddard, et al. Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 593–602, 2023. 1, 2

work page 2023
[7]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated con- volutional networks, 2017. 3

work page 2017
[8]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 2

work page 2021
[9]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023. 3

work page 2023
[10]

Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024. 1, 2, 3, 4

work page 2024
[11]

Unsupervised pre-training for person re- identification, 2021

Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsupervised pre-training for person re- identification, 2021. 3

work page 2021
[12]

Aonet: attentional occlusion-aware network for occluded person re-identification

Guangyu Gao, Qianxiang Wang, Jing Ge, and Yan Zhang. Aonet: attentional occlusion-aware network for occluded person re-identification. In Proceedings of the Asian conference on computer vision , pages 1606–1621, 2022. 3

work page 2022
[13]

Understanding the difficulty of training deep feedforward neural net- works

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net- works. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages 249–256, Chia Laguna Resort, Sardinia, Italy,

work page
[14]

X. Gu, H. Chang, B. Ma, S. Bai, S. Shan, and X. Chen. Clothes-changing person re-identification with rgb modality only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1060–1069, 2022. 2

work page 2022
[15]

Clothes-changing person re-identification with rgb modality only, 2022

Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only, 2022. 3, 5, 7

work page 2022
[16]

Dissecting the time course of person recogni- tion in natural viewing environments

Carina A Hahn, Alice J O’Toole, and P Jonathon Phillips. Dissecting the time course of person recogni- tion in natural viewing environments. British Journal of Psychology, 107(1):117–134, 2016. 1

work page 2016
[17]

Clothing-change feature augmenta- tion for person re-identification

Ke Han, Shaogang Gong, Yan Huang, Liang Wang, and Tieniu Tan. Clothing-change feature augmenta- tion for person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22066–22075, 2023. 2

work page 2023
[18]

Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai Xiang. Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

work page
[19]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition,

work page
[20]

Transreid: Transformer-based ob- ject re-identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based ob- ject re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15013–15022, 2021. 2

work page 2021
[21]

Gaussian error lin- ear units (gelus), 2023

Dan Hendrycks and Kevin Gimpel. Gaussian error lin- ear units (gelus), 2023. 3

work page 2023
[22]

Rotary position embedding for vision trans- former, 2024

Byeongho Heo, Song Park, Dongyoon Han, and Sang- doo Yun. Rotary position embedding for vision trans- former, 2024. 3

work page 2024
[23]

Whole- body detection, identification and recognition at alti- tude and range

Siyuan Huang, Ram Prabhakar Kathirvel, Yuxiang Guo, Chun Pong Lau, and Rama Chellappa. Whole- body detection, identification and recognition at alti- tude and range. IEEE Transactions on Biometrics, Be- havior, and Identity Science, 2024. 2

work page 2024
[24]

Vills – video- image learning to learn semantics for person re- identification, 2024

Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, and Cheng Peng. Vills – video- image learning to learn semantics for person re- identification, 2024. 3, 4, 5, 6, 7

work page 2024
[25]

Huang, Q

Y . Huang, Q. Wu, J. Xu, and Y . Zhong. Celebrities- reid: A benchmark for clothes variation in long-term person re-identification. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019. 4

work page 2019
[26]

Clothing status awareness for long-term person re-identification

Yan Huang, Qiang Wu, JingSong Xu, Yi Zhong, and ZhaoXiang Zhang. Clothing status awareness for long-term person re-identification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11875–11884, 2021. 2

work page 2021
[27]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything,

work page
[28]

The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices

SV Aruna Kumar, Ehsan Yaghoubi, Abhijit Das, BS Harish, and Hugo Proenc ¸a. The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices. IEEE Transactions on Information Forensics and Se- curity, 16:1696–1708, 2020. 2

work page 2020
[29]

The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020. 4

work page 1956
[30]

Attribute de-biased vision transformer (ad-vit) for long-term person re-identification

Kyung Won Lee, Bhavin Jawade, Deen Mohan, Sri- rangaraj Setlur, and Venu Govindaraju. Attribute de-biased vision transformer (ad-vit) for long-term person re-identification. In 2022 18th IEEE Inter- national Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8, 2022. 2

work page 2022
[31]

Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023. 3

work page 2023
[32]

Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024

Shuang Li, Jiaxu Leng, Guozhang Li, Ji Gan, Haosheng chen, and Xinbo Gao. Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024. 3

work page 2024
[33]

Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles

Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wen- qian Wang, and Zhiheng Li. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 16261–16270, 2021. 2

work page 2021
[34]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Computer Vision – ECCV 2014 , pages 740–755, Cham, 2014. Springer International Publishing. 4

work page 2014
[35]

Distilling clip with dual guidance for learning discriminative human body shape representation

Feng Liu, Minchul Kim, Zhiyuan Ren, and Xiaoming Liu. Distilling clip with dual guidance for learning discriminative human body shape representation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 256–266, 2024. 3

work page 2024
[36]

Swin transformer: Hierarchical vision transformer using shifted windows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. 1, 2

work page 2021
[37]

Self- supervised pre-training for transformer-based person re-identification, 2021

Hao Luo, Pichao Wang, Yi Xu, Feng Ding, Yanxin Zhou, Fan Wang, Hao Li, and Rong Jin. Self- supervised pre-training for transformer-based person re-identification, 2021. 2

work page 2021
[38]

Subject identification up to 1km: Performer perspective on the iarpa briar program

Scott McCloskey, Brandon RichardWebster, Roddy Collins, and Anthony Hoogs. Subject identification up to 1km: Performer perspective on the iarpa briar program. Proceedings of the National Security Sensor and Data Fusion Committee (NSSDF), 2023. 2

work page 2023
[39]

Dissecting human body representations in deep networks trained for person identification, 2025

Thomas M Metz, Matthew Q Hill, Blake Myers, Veda Nandan Gandi, Rahul Chilakapati, and Alice J O’Toole. Dissecting human body representations in deep networks trained for person identification, 2025. 2, 3, 8

work page 2025
[40]

Myers, Lucas Jaggernauth, Thomas M

Blake A. Myers, Lucas Jaggernauth, Thomas M. Metz, Matthew Q. Hill, Veda Nandan Gandi, Car- los D. Castillo, and Alice J. O’Toole. Recognizing people by body shape using deep networks of images and words. Proceedings of the IEEE: International Joint Conference on Biometrics, 2023. 2 10

work page 2023
[41]

Unconstrained body recognition at altitude and range: Comparing four approaches, 2025

Blake A Myers, Matthew Q Hill, Veda Nandan Gandi, Thomas M Metz, and Alice J O’Toole. Unconstrained body recognition at altitude and range: Comparing four approaches, 2025. 1, 2, 3, 4, 6, 7

work page 2025
[42]

Masked attribute description embedding for cloth-changing person re- identification, 2024

Chunlei Peng, Boyu Wang, Decheng Liu, Nannan Wang, Ruimin Hu, and Xinbo Gao. Masked attribute description embedding for cloth-changing person re- identification, 2024. 4

work page 2024
[43]

Long-term cloth-changing person re- identification, 2020

Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xi- angyang Xue. Long-term cloth-changing person re- identification, 2020. 2

work page 2020
[44]

Learning trans- ferable visual models from natural language supervi- sion, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion, 2021. 3

work page 2021
[45]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2017. 3

work page 2017
[46]

Imagenet-21k pretraining for the masses, 2021

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021. 4

work page 2021
[47]

Imagenet large scale visual recognition chal- lenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition chal- lenge. International journal of computer vision , 115: 211–252, 2015. 2, 4

work page 2015
[48]

Ob- jects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob- jects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV) , pages 8429–8438,

work page 2019
[49]

Kapil, and David Chap- man

Charu Sharma, Siddhant R. Kapil, and David Chap- man. Person re-identification with a locally aware transformer, 2021. 2

work page 2021
[50]

Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers) , pages 2556–2565, Melbourne, Australia, 2018. Association for Compu- tatio...

work page 2018
[51]

Glu variants improve transformer,

Noam Shazeer. Glu variants improve transformer,

work page
[52]

X. Shu, X. Wang, X. Zang, S. Zhang, Y . Chen, G. Li, and Q. Tian. Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Tech- nology, 32(7):4390–4403, 2021. 4

work page 2021
[53]

Body part-based representation learning for occluded person re-identification

Vladimir Somers, Christophe De Vleeschouwer, and Alexandre Alahi. Body part-based representation learning for occluded person re-identification. In Pro- ceedings of the IEEE/CVF winter conference on appli- cations of computer vision, pages 1613–1623, 2023. 3

work page 2023
[54]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 3

work page 2023
[55]

Eva-clip: Improved training techniques for clip at scale, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023. 3

work page 2023
[56]

Foundation transformers, 2022

Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Pa- tra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers, 2022. 3

work page 2022
[57]

A benchmark for clothes variation in person re-identification

Kai Wang, Zhi Ma, Shiyan Chen, Jinni Yang, Keke Zhou, and Tao Li. A benchmark for clothes variation in person re-identification. International Journal of Intelligent Systems, 35(12):1881–1898, 2020. 2

work page 2020
[58]

Person transfer gan to bridge domain gap for person re-identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 79–88, 2018. 2

work page 2018
[59]

Revealing the dark se- crets of masked image modeling, 2022

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark se- crets of masked image modeling, 2022. 3

work page 2022
[60]

Deepchange: A large long- term person re-identification benchmark with clothes change, 2022

Peng Xu and Xiatian Zhu. Deepchange: A large long- term person re-identification benchmark with clothes change, 2022. 6

work page 2022
[61]

Deepchange: A long- term person re-identification benchmark with clothes change

Peng Xu and Xiatian Zhu. Deepchange: A long- term person re-identification benchmark with clothes change. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 11196– 11205, 2023. 2

work page 2023
[62]

Occluded person re- identification with single-scale global representations

Cheng Yan, Guansong Pang, Jile Jiao, Xiao Bai, Xue- tao Feng, and Chunhua Shen. Occluded person re- identification with single-scale global representations. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11875–11884, 2021. 3

work page 2021
[63]

Per- son re-identification by contour sketch under moder- ate clothing change

Qize Yang, Ancong Wu, and Wei-Shi Zheng. Per- son re-identification by contour sketch under moder- ate clothing change. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 6

work page 2019
[64]

Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification

Zhengwei Yang, Meng Lin, Xian Zhong, Yu Wu, and Zheng Wang. Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1472–1481,

work page 2023
[65]

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 44(6):2872–2893, 2022. 1

work page 2022
[66]

Cocas: A large-scale clothes chang- ing person dataset for re-identification

Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, and Yu Qiao. Cocas: A large-scale clothes chang- ing person dataset for re-identification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3400–3409, 2020. 1

work page 2020
[67]

Hat: Hierarchical aggregation trans- formers for person re-identification

Guowen Zhang, Pingping Zhang, Jinqing Qi, and Huchuan Lu. Hat: Hierarchical aggregation trans- formers for person re-identification. In Proceedings of the 29th ACM International Conference on Mul- timedia, page 516–525, New York, NY , USA, 2021. Association for Computing Machinery. 2

work page 2021
[68]

3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation

Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, and Alan Yuille. 3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 9399– 9410, 2023. 3

work page 2023
[69]

Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025

Huazhong Zhao, Lei Qi, and Xin Geng. Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025. 3

work page 2025
[70]

Scalable person re- identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1116–1124, 2015. 6

work page 2015
[71]

Scalable person re- identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In Proceedings of the IEEE international conference on computer vision , pages 1116–1124, 2015. 2

work page 2015
[72]

Mars: A video benchmark for large-scale person re-identification

Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11- 14, 2016, Proceedings, Part VI 14 , pages 868–884. Springer, 2016. 2

work page 2016
[73]

Se- mantic understanding of scenes through the ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision , 127(3):302–321, 2019. 4

work page 2019
[74]

Sharc: Shape and appearance recogni- tion for person identification in-the-wild

Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, and Ram Nevatia. Sharc: Shape and appearance recogni- tion for person identification in-the-wild. In Proceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6290–6300, 2024. 3

work page 2024
[75]

Occluded person re-identification

Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guang- cong Wang. Occluded person re-identification. In 2018 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2018. 3 12

work page 2018

[1] [1]

Foundation models defining a new era in vision: A survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264,

work page

[2] [2]

Cloth-changing person re-identification with self-attention

Vaibhav Bansal, Gian Luca Foresti, and Niki Mar- tinel. Cloth-changing person re-identification with self-attention. In 2022 IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops (WACVW), pages 602–610, 2022. 2

work page 2022

[3] [3]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In 2021 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 3557–3567, 2021. 4

work page 2021

[4] [4]

Occlude them all: Occlusion- aware attention network for occluded person re-id

Peixian Chen, Wenfeng Liu, Pingyang Dai, Jianzhuang Liu, Qixiang Ye, Mingliang Xu, Qi’an Chen, and Rongrong Ji. Occlude them all: Occlusion- aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11833–11842, 2021. 3

work page 2021

[5] [5]

Oc4-reid: Occluded cloth- changing person re-identification, 2024

Zhihao Chen, Yiyuan Ge, Ziyang Wang, Jiaju Kang, and Mingya Zhang. Oc4-reid: Occluded cloth- changing person re-identification, 2024. 8

work page 2024

[6] [6]

Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset

David Cornett, Joel Brogan, Nell Barber, Deniz Aykac, Seth Baird, Nicholas Burchfield, Carl Dukes, Andrew Duncan, Regina Ferrell, Jim Goddard, et al. Expanding accurate person recognition to new alti- tudes and ranges: The briar dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 593–602, 2023. 1, 2

work page 2023

[7] [7]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated con- volutional networks, 2017. 3

work page 2017

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 2

work page 2021

[9] [9]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023. 3

work page 2023

[10] [10]

Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A vi- sual representation for neon genesis.Image and Vision Computing, 149:105171, 2024. 1, 2, 3, 4

work page 2024

[11] [11]

Unsupervised pre-training for person re- identification, 2021

Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsupervised pre-training for person re- identification, 2021. 3

work page 2021

[12] [12]

Aonet: attentional occlusion-aware network for occluded person re-identification

Guangyu Gao, Qianxiang Wang, Jing Ge, and Yan Zhang. Aonet: attentional occlusion-aware network for occluded person re-identification. In Proceedings of the Asian conference on computer vision , pages 1606–1621, 2022. 3

work page 2022

[13] [13]

Understanding the difficulty of training deep feedforward neural net- works

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net- works. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages 249–256, Chia Laguna Resort, Sardinia, Italy,

work page

[14] [14]

X. Gu, H. Chang, B. Ma, S. Bai, S. Shan, and X. Chen. Clothes-changing person re-identification with rgb modality only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1060–1069, 2022. 2

work page 2022

[15] [15]

Clothes-changing person re-identification with rgb modality only, 2022

Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only, 2022. 3, 5, 7

work page 2022

[16] [16]

Dissecting the time course of person recogni- tion in natural viewing environments

Carina A Hahn, Alice J O’Toole, and P Jonathon Phillips. Dissecting the time course of person recogni- tion in natural viewing environments. British Journal of Psychology, 107(1):117–134, 2016. 1

work page 2016

[17] [17]

Clothing-change feature augmenta- tion for person re-identification

Ke Han, Shaogang Gong, Yan Huang, Liang Wang, and Tieniu Tan. Clothing-change feature augmenta- tion for person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22066–22075, 2023. 2

work page 2023

[18] [18]

Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai Xiang. Clip-scgi: Synthesized 9 caption-guided inversion for person re-identification,

work page

[19] [19]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition,

work page

[20] [20]

Transreid: Transformer-based ob- ject re-identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based ob- ject re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15013–15022, 2021. 2

work page 2021

[21] [21]

Gaussian error lin- ear units (gelus), 2023

Dan Hendrycks and Kevin Gimpel. Gaussian error lin- ear units (gelus), 2023. 3

work page 2023

[22] [22]

Rotary position embedding for vision trans- former, 2024

Byeongho Heo, Song Park, Dongyoon Han, and Sang- doo Yun. Rotary position embedding for vision trans- former, 2024. 3

work page 2024

[23] [23]

Whole- body detection, identification and recognition at alti- tude and range

Siyuan Huang, Ram Prabhakar Kathirvel, Yuxiang Guo, Chun Pong Lau, and Rama Chellappa. Whole- body detection, identification and recognition at alti- tude and range. IEEE Transactions on Biometrics, Be- havior, and Identity Science, 2024. 2

work page 2024

[24] [24]

Vills – video- image learning to learn semantics for person re- identification, 2024

Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, and Cheng Peng. Vills – video- image learning to learn semantics for person re- identification, 2024. 3, 4, 5, 6, 7

work page 2024

[25] [25]

Huang, Q

Y . Huang, Q. Wu, J. Xu, and Y . Zhong. Celebrities- reid: A benchmark for clothes variation in long-term person re-identification. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019. 4

work page 2019

[26] [26]

Clothing status awareness for long-term person re-identification

Yan Huang, Qiang Wu, JingSong Xu, Yi Zhong, and ZhaoXiang Zhang. Clothing status awareness for long-term person re-identification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11875–11884, 2021. 2

work page 2021

[27] [27]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything,

work page

[28] [28]

The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices

SV Aruna Kumar, Ehsan Yaghoubi, Abhijit Das, BS Harish, and Hugo Proenc ¸a. The p-destre: A fully an- notated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices. IEEE Transactions on Information Forensics and Se- curity, 16:1696–1708, 2020. 2

work page 2020

[29] [29]

The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4.International Journal of Com- puter Vision, 128(7):1956–1981, 2020. 4

work page 1956

[30] [30]

Attribute de-biased vision transformer (ad-vit) for long-term person re-identification

Kyung Won Lee, Bhavin Jawade, Deen Mohan, Sri- rangaraj Setlur, and Venu Govindaraju. Attribute de-biased vision transformer (ad-vit) for long-term person re-identification. In 2022 18th IEEE Inter- national Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8, 2022. 2

work page 2022

[31] [31]

Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: Exploit- ing vision-language model for image re-identification without concrete text labels, 2023. 3

work page 2023

[32] [32]

Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024

Shuang Li, Jiaxu Leng, Guozhang Li, Ji Gan, Haosheng chen, and Xinbo Gao. Clip-driven cloth- agnostic feature learning for cloth-changing person re- identification, 2024. 3

work page 2024

[33] [33]

Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles

Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wen- qian Wang, and Zhiheng Li. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 16261–16270, 2021. 2

work page 2021

[34] [34]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Computer Vision – ECCV 2014 , pages 740–755, Cham, 2014. Springer International Publishing. 4

work page 2014

[35] [35]

Distilling clip with dual guidance for learning discriminative human body shape representation

Feng Liu, Minchul Kim, Zhiyuan Ren, and Xiaoming Liu. Distilling clip with dual guidance for learning discriminative human body shape representation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 256–266, 2024. 3

work page 2024

[36] [36]

Swin transformer: Hierarchical vision transformer using shifted windows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. 1, 2

work page 2021

[37] [37]

Self- supervised pre-training for transformer-based person re-identification, 2021

Hao Luo, Pichao Wang, Yi Xu, Feng Ding, Yanxin Zhou, Fan Wang, Hao Li, and Rong Jin. Self- supervised pre-training for transformer-based person re-identification, 2021. 2

work page 2021

[38] [38]

Subject identification up to 1km: Performer perspective on the iarpa briar program

Scott McCloskey, Brandon RichardWebster, Roddy Collins, and Anthony Hoogs. Subject identification up to 1km: Performer perspective on the iarpa briar program. Proceedings of the National Security Sensor and Data Fusion Committee (NSSDF), 2023. 2

work page 2023

[39] [39]

Dissecting human body representations in deep networks trained for person identification, 2025

Thomas M Metz, Matthew Q Hill, Blake Myers, Veda Nandan Gandi, Rahul Chilakapati, and Alice J O’Toole. Dissecting human body representations in deep networks trained for person identification, 2025. 2, 3, 8

work page 2025

[40] [40]

Myers, Lucas Jaggernauth, Thomas M

Blake A. Myers, Lucas Jaggernauth, Thomas M. Metz, Matthew Q. Hill, Veda Nandan Gandi, Car- los D. Castillo, and Alice J. O’Toole. Recognizing people by body shape using deep networks of images and words. Proceedings of the IEEE: International Joint Conference on Biometrics, 2023. 2 10

work page 2023

[41] [41]

Unconstrained body recognition at altitude and range: Comparing four approaches, 2025

Blake A Myers, Matthew Q Hill, Veda Nandan Gandi, Thomas M Metz, and Alice J O’Toole. Unconstrained body recognition at altitude and range: Comparing four approaches, 2025. 1, 2, 3, 4, 6, 7

work page 2025

[42] [42]

Masked attribute description embedding for cloth-changing person re- identification, 2024

Chunlei Peng, Boyu Wang, Decheng Liu, Nannan Wang, Ruimin Hu, and Xinbo Gao. Masked attribute description embedding for cloth-changing person re- identification, 2024. 4

work page 2024

[43] [43]

Long-term cloth-changing person re- identification, 2020

Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xi- angyang Xue. Long-term cloth-changing person re- identification, 2020. 2

work page 2020

[44] [44]

Learning trans- ferable visual models from natural language supervi- sion, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion, 2021. 3

work page 2021

[45] [45]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2017. 3

work page 2017

[46] [46]

Imagenet-21k pretraining for the masses, 2021

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021. 4

work page 2021

[47] [47]

Imagenet large scale visual recognition chal- lenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition chal- lenge. International journal of computer vision , 115: 211–252, 2015. 2, 4

work page 2015

[48] [48]

Ob- jects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob- jects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV) , pages 8429–8438,

work page 2019

[49] [49]

Kapil, and David Chap- man

Charu Sharma, Siddhant R. Kapil, and David Chap- man. Person re-identification with a locally aware transformer, 2021. 2

work page 2021

[50] [50]

Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic im- age captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers) , pages 2556–2565, Melbourne, Australia, 2018. Association for Compu- tatio...

work page 2018

[51] [51]

Glu variants improve transformer,

Noam Shazeer. Glu variants improve transformer,

work page

[52] [52]

X. Shu, X. Wang, X. Zang, S. Zhang, Y . Chen, G. Li, and Q. Tian. Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Tech- nology, 32(7):4390–4403, 2021. 4

work page 2021

[53] [53]

Body part-based representation learning for occluded person re-identification

Vladimir Somers, Christophe De Vleeschouwer, and Alexandre Alahi. Body part-based representation learning for occluded person re-identification. In Pro- ceedings of the IEEE/CVF winter conference on appli- cations of computer vision, pages 1613–1623, 2023. 3

work page 2023

[54] [54]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 3

work page 2023

[55] [55]

Eva-clip: Improved training techniques for clip at scale, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale, 2023. 3

work page 2023

[56] [56]

Foundation transformers, 2022

Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Pa- tra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers, 2022. 3

work page 2022

[57] [57]

A benchmark for clothes variation in person re-identification

Kai Wang, Zhi Ma, Shiyan Chen, Jinni Yang, Keke Zhou, and Tao Li. A benchmark for clothes variation in person re-identification. International Journal of Intelligent Systems, 35(12):1881–1898, 2020. 2

work page 2020

[58] [58]

Person transfer gan to bridge domain gap for person re-identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 79–88, 2018. 2

work page 2018

[59] [59]

Revealing the dark se- crets of masked image modeling, 2022

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark se- crets of masked image modeling, 2022. 3

work page 2022

[60] [60]

Deepchange: A large long- term person re-identification benchmark with clothes change, 2022

Peng Xu and Xiatian Zhu. Deepchange: A large long- term person re-identification benchmark with clothes change, 2022. 6

work page 2022

[61] [61]

Deepchange: A long- term person re-identification benchmark with clothes change

Peng Xu and Xiatian Zhu. Deepchange: A long- term person re-identification benchmark with clothes change. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 11196– 11205, 2023. 2

work page 2023

[62] [62]

Occluded person re- identification with single-scale global representations

Cheng Yan, Guansong Pang, Jile Jiao, Xiao Bai, Xue- tao Feng, and Chunhua Shen. Occluded person re- identification with single-scale global representations. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 11875–11884, 2021. 3

work page 2021

[63] [63]

Per- son re-identification by contour sketch under moder- ate clothing change

Qize Yang, Ancong Wu, and Wei-Shi Zheng. Per- son re-identification by contour sketch under moder- ate clothing change. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 6

work page 2019

[64] [64]

Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification

Zhengwei Yang, Meng Lin, Xian Zhong, Yu Wu, and Zheng Wang. Good is bad: Causality inspired cloth- debiasing for cloth-changing person re-identification. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1472–1481,

work page 2023

[65] [65]

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 44(6):2872–2893, 2022. 1

work page 2022

[66] [66]

Cocas: A large-scale clothes chang- ing person dataset for re-identification

Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, and Yu Qiao. Cocas: A large-scale clothes chang- ing person dataset for re-identification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3400–3409, 2020. 1

work page 2020

[67] [67]

Hat: Hierarchical aggregation trans- formers for person re-identification

Guowen Zhang, Pingping Zhang, Jinqing Qi, and Huchuan Lu. Hat: Hierarchical aggregation trans- formers for person re-identification. In Proceedings of the 29th ACM International Conference on Mul- timedia, page 516–525, New York, NY , USA, 2021. Association for Computing Machinery. 2

work page 2021

[68] [68]

3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation

Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, and Alan Yuille. 3d-aware neu- ral body fitting for occlusion robust 3d human pose estimation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 9399– 9410, 2023. 3

work page 2023

[69] [69]

Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025

Huazhong Zhao, Lei Qi, and Xin Geng. Cilp-fgdi: Ex- ploiting vision-language model for generalizable per- son re-identification, 2025. 3

work page 2025

[70] [70]

Scalable person re- identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1116–1124, 2015. 6

work page 2015

[71] [71]

Scalable person re- identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re- identification: A benchmark. In Proceedings of the IEEE international conference on computer vision , pages 1116–1124, 2015. 2

work page 2015

[72] [72]

Mars: A video benchmark for large-scale person re-identification

Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11- 14, 2016, Proceedings, Part VI 14 , pages 868–884. Springer, 2016. 2

work page 2016

[73] [73]

Se- mantic understanding of scenes through the ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision , 127(3):302–321, 2019. 4

work page 2019

[74] [74]

Sharc: Shape and appearance recogni- tion for person identification in-the-wild

Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, and Ram Nevatia. Sharc: Shape and appearance recogni- tion for person identification in-the-wild. In Proceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6290–6300, 2024. 3

work page 2024

[75] [75]

Occluded person re-identification

Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guang- cong Wang. Occluded person re-identification. In 2018 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2018. 3 12

work page 2018