Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

Karina Kvanchiani; Timur Mamedov

arxiv: 2606.02242 · v1 · pith:KJJTG3OEnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.LG

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

Karina Kvanchiani , Timur Mamedov This is my paper

Pith reviewed 2026-06-28 15:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords person re-identificationimage-to-image retrievaltext-to-image retrievalcross-modal learningdecoupled trainingvision encoderoptimization conflicts

0 comments

The pith

A decoupled two-stage pipeline with one vision encoder trains image-based and text-based person re-identification without cross-task interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that joint optimization of I2I and T2I person re-identification creates suboptimal shared representations because of modality gaps and opposing loss objectives. It shows that a two-stage process—first training the vision encoder on I2I data, then adding textual supervision—lets the same encoder handle both retrieval modes. Experiments across mixing strategies and objectives indicate I2I pre-training improves T2I generalization and text supervision boosts performance on both tasks. A sympathetic reader would care because this offers a practical route to unified cross-modal ReID systems that avoid the interference seen in simultaneous training.

Core claim

The central claim is that modality discrepancies and conflicting objectives hinder joint I2I-T2I training, and that a decoupled two-stage pipeline built on a single vision encoder supports both retrieval settings while avoiding interference; I2I pre-training aids T2I generalization and textual supervision during encoder training improves results on both.

What carries the argument

The decoupled two-stage training pipeline based on a single vision encoder that separates I2I pre-training from later text supervision.

If this is right

I2I ReID pre-training improves generalization to T2I retrieval data.
Adding textual supervision while training the vision encoder raises accuracy for both I2I and T2I tasks.
The two-stage pipeline prevents the negative transfer that occurs when I2I and T2I losses are optimized together.
Varying domain mixing, learning strategies, and task objectives confirms the pipeline works across multiple configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged separation could be tested on other cross-modal retrieval problems where identity-level and instance-level objectives compete.
If the pattern holds, unified ReID systems might routinely adopt pre-training on the easier modality before introducing the harder one.
The findings imply that future encoder designs should expose separate optimization phases rather than relying on a single joint loss.

Load-bearing premise

Modality discrepancies and conflicting objectives are the primary causes of suboptimal shared representations, and separating the training stages resolves them without losing benefits that simultaneous optimization might provide.

What would settle it

A controlled experiment in which simultaneous joint optimization of I2I and T2I objectives on the same encoder yields equal or higher accuracy on both retrieval tasks than the proposed two-stage pipeline.

Figures

Figures reproduced from arXiv: 2606.02242 by Karina Kvanchiani, Timur Mamedov.

**Figure 2.** Figure 2: Impact of text- and image-based domain data incorporation during vision encoder pre-training. Bars on the left graph represent [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupled pipeline claim isn't backed by a joint-training baseline, so the interference story remains unproven.

read the letter

The key takeaway is that this paper's decoupled two-stage training for image and text person ReID rests on an assumption that isn't directly tested. They argue that joint optimization suffers from modality discrepancies and conflicting objectives, but they don't run the joint case with the same single vision encoder to show the interference.

What they do report is that pre-training the vision encoder on I2I ReID improves performance on T2I, and that adding textual supervision during that stage helps both tasks. The abstract frames this as evidence that the two-stage approach avoids cross-task interference. If the experiments are clean, that could be a useful practical tip for people building multi-modal ReID systems.

The soft spot is exactly the one in the stress-test note. Without a controlled joint-optimization run on the identical encoder and loss combination, it's hard to attribute the gains to removal of interference rather than just the benefits of staged training or different hyper-parameters. The abstract mentions varying domain mixing, learning strategies, and task objectives, but skips the direct comparator. That makes the central claim weaker than it could be.

The paper also doesn't give any concrete numbers, dataset names, or baseline comparisons in the abstract, which is odd for claiming performance improvements. This makes it difficult to assess how substantial the findings are.

This kind of work is aimed at the person ReID community, particularly those interested in cross-modal retrieval. A reader working on similar multi-task vision-language setups might find the observations on pre-training effects worth checking, but only if the full paper fills in the experimental details and adds the missing baseline.

I'd recommend against sending it to peer review in its current form because the main claim needs that control experiment to hold up. If they add it and the results still favor the two-stage approach, then it could be worth a review.

Referee Report

2 major / 0 minor

Summary. The paper claims that joint optimization of image-based (I2I) and text-based (T2I) person re-identification is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. It proposes a decoupled two-stage training pipeline using a single vision encoder to support both I2I and T2I retrieval while avoiding cross-task interference. Experiments across multiple configurations (varying domain mixing, learning strategies, and task objectives) show that I2I ReID pre-training improves generalization to T2I data and that textual supervision during vision encoder training enhances performance on both tasks.

Significance. If the central claims hold after addressing the experimental gaps, the work would provide useful empirical insights into training unified ReID systems by separating optimization stages, with the specific observations on I2I pre-training benefits and textual supervision effects offering practical guidance for cross-modal retrieval. The paper receives credit for exploring the fundamental differences between the two ReID tasks and for conducting experiments that vary multiple training factors.

major comments (2)

[Experiments (as described)] The central claim that the decoupled two-stage pipeline resolves optimization conflicts by avoiding cross-task interference requires a direct comparator, but no joint-optimization baseline using the identical single vision encoder and the same I2I + T2I loss combination is reported. Without this controlled run, gains from I2I pre-training or textual supervision cannot be confidently attributed to removal of interference rather than staged optimization dynamics, data ordering, or hyper-parameter effects.
[Abstract] The abstract states experimental observations and performance improvements but provides no details on datasets, metrics, baselines, error bars, or exclusion criteria. This omission prevents assessment of whether the data supports the claims as stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We respond to each major comment below.

read point-by-point responses

Referee: [Experiments (as described)] The central claim that the decoupled two-stage pipeline resolves optimization conflicts by avoiding cross-task interference requires a direct comparator, but no joint-optimization baseline using the identical single vision encoder and the same I2I + T2I loss combination is reported. Without this controlled run, gains from I2I pre-training or textual supervision cannot be confidently attributed to removal of interference rather than staged optimization dynamics, data ordering, or hyper-parameter effects.

Authors: We agree that a direct joint-optimization baseline with the identical single vision encoder and the combined I2I + T2I loss would provide a stronger control experiment. Our reported results vary domain mixing, learning strategies, and task objectives, but do not include this exact joint-training comparator. We will add the requested baseline in the revision to better isolate the effect of decoupling. revision: yes
Referee: [Abstract] The abstract states experimental observations and performance improvements but provides no details on datasets, metrics, baselines, error bars, or exclusion criteria. This omission prevents assessment of whether the data supports the claims as stated.

Authors: We will revise the abstract to include the primary datasets, metrics, key baselines, and a brief note on error bars or statistical reporting while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims without derivations or self-referential reductions

full rationale

The paper contains no equations, loss derivations, fitted parameters presented as predictions, or uniqueness theorems. All central claims (decoupled two-stage pipeline benefits, I2I pre-training impact, textual supervision gains) rest on experimental observations across configurations. No self-citation chains or ansatzes are invoked to justify the method; the pipeline is introduced as a proposal and evaluated directly. The absence of a joint-optimization baseline is a methodological gap but does not constitute circularity, as no derivation reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical content, free parameters, axioms, or invented entities are introduced in the abstract; the work is an empirical proposal for a training strategy.

pith-pipeline@v0.9.1-grok · 5779 in / 1282 out tokens · 29036 ms · 2026-06-28T15:33:40.283505+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 4 canonical work pages

[1]

Rasa: Relation and sensitivity aware representation learning for text-based person search.arXiv preprint arXiv:2305.13653, 2023

Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: Relation and sensitivity aware representation learning for text-based person search.arXiv preprint arXiv:2305.13653, 2023

work page arXiv 2023
[2]

Semantically self-aligned network for text-to-image part-aware person re-identification

Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to-image part-aware person re-identification. arxiv 2021.arXiv preprint arXiv:2107.12666, 2021

work page arXiv 2021
[3]

Mars: Paying more attention to visual attributes for text-based person search.ACM Transac- tions on Multimedia Computing, Communications and Ap- plications, 21(10):1–22, 2025

Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, and Andrea Prati. Mars: Paying more attention to visual attributes for text-based person search.ACM Transac- tions on Multimedia Computing, Communications and Ap- plications, 21(10):1–22, 2025

2025
[4]

Large-scale pre-training for person re-identification with noisy labels

Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang Li, Fang Wen, and Dong Chen. Large-scale pre-training for person re-identification with noisy labels. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2476–2486, 2022

2022
[5]

Transreid: Transformer-based object re- identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15013–15022, 2021

2021
[6]

Instruct-reid: A multi-purpose person re-identification task with instructions

Weizhen He, Yiheng Deng, Shixiang Tang, Qihao Chen, Qingsong Xie, Yizhou Wang, Lei Bai, Feng Zhu, Rui Zhao, Wanli Ouyang, et al. Instruct-reid: A multi-purpose person re-identification task with instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17521–17531, 2024

2024
[7]

Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2787–2797, 2023

2023
[8]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI confer- ence on artificial intelligence, volume 37, pages 1405–1413, 2023

2023
[9]

Person search with natural lan- guage description

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1970–1979, 2017

1970
[10]

Deep- reid: Deep filter pairing neural network for person re- identification

Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep- reid: Deep filter pairing neural network for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 152–159, 2014

2014
[11]

Dynamix: Generalizable person re-identification via dy- namic relabeling and mixed data sampling.Neurocomputing, page 132446, 2025

Timur Mamedov, Anton Konushin, and Vadim Konushin. Dynamix: Generalizable person re-identification via dy- namic relabeling and mixed data sampling.Neurocomputing, page 132446, 2025

2025
[12]

Remix: Training generalized person re-identification on a mixture of data

Timur Mamedov, Anton Konushin, and Vadim Konushin. Remix: Training generalized person re-identification on a mixture of data. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8186–8196. IEEE Computer Society, 2025

2025
[13]

Retext: Text boosts generalization in image-based person re-identification.arXiv:2602.05785, 2026

Timur Mamedov, Karina Kvanchiani, Anton Konushin, and Vadim Konushin. Retext: Text boosts generalization in image-based person re-identification.arXiv:2602.05785, 2026

work page arXiv 2026
[14]

A culturally-aware benchmark for person re-identification in modest attire.Engineering Ap- plications of Artificial Intelligence, 158:111494, 2025

Alireza Sedighi Moghaddam, Fatemeh Anvari, Mohammad- javad Mirshekari Haghighi, Mohammadali Fakhari, and Mo- hammad Reza Mohammadi. A culturally-aware benchmark for person re-identification in modest attire.Engineering Ap- plications of Artificial Intelligence, 158:111494, 2025

2025
[15]

Noisy-correspondence learning for text-to-image person re-identification

Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024

2024
[16]

Learnable pillar-based re- ranking for image-text retrieval

Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, and Tat-Seng Chua. Learnable pillar-based re- ranking for image-text retrieval. InProceedings of the 46th international ACM SIGIR conference on research and devel- opment in information retrieval, pages 1252–1261, 2023

2023
[17]

Soft contrastive cross-modal re- trieval.Applied Sciences, 14(5):1944, 2024

Jiayu Song, Yuxuan Hu, Lei Zhu, Chengyuan Zhang, Jian Zhang, and Shichao Zhang. Soft contrastive cross-modal re- trieval.Applied Sciences, 14(5):1944, 2024

1944
[18]

Panda: A gigapixel- level human-centric video dataset

Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David Brady, Qionghai Dai, et al. Panda: A gigapixel- level human-centric video dataset. InProceedings of 7 the IEEE/CVF conference on computer vision and pattern recognition, pages 3268–3278, 2020

2020
[19]

Person transfer gan to bridge domain gap for person re- identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018

2018
[20]

Joint detection and identification feature learn- ing for person search

Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiao- gang Wang. Joint detection and identification feature learn- ing for person search. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3415– 3424, 2017

2017
[21]

Entire-id: An exten- sive and diverse dataset for person re-identification

Serdar Yıldız and Ahmet Nezih Kasım. Entire-id: An exten- sive and diverse dataset for person re-identification. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–5. IEEE, 2024

2024
[22]

Image–text cross-modal retrieval with instance con- trastive embedding.Electronics, 13(2):300, 2024

Ruigeng Zeng, Wentao Ma, Xiaoqian Wu, Wei Liu, and Jie Liu. Image–text cross-modal retrieval with instance con- trastive embedding.Electronics, 13(2):300, 2024

2024
[23]

An efficient cross-modal privacy-preserving image–text re- trieval scheme.Symmetry, 16(8):1084, 2024

Kejun Zhang, Shaofei Xu, Yutuo Song, Yuwei Xu, Pengcheng Li, Xiang Yang, Bing Zou, and Wenbin Wang. An efficient cross-modal privacy-preserving image–text re- trieval scheme.Symmetry, 16(8):1084, 2024

2024
[24]

An open-world, diverse, cross-spatial-temporal bench- mark for dynamic wild person re-identification.Interna- tional Journal of Computer Vision, 132(9):3823–3846, 2024

Lei Zhang, Xiaowei Fu, Fuxiang Huang, Yi Yang, and Xinbo Gao. An open-world, diverse, cross-spatial-temporal bench- mark for dynamic wild person re-identification.Interna- tional Journal of Computer Vision, 132(9):3823–3846, 2024

2024
[25]

Image-text retrieval via contrastive learning with auxiliary generative features and support-set regularization

Lei Zhang, Min Yang, Chengming Li, and Ruifeng Xu. Image-text retrieval via contrastive learning with auxiliary generative features and support-set regularization. InPro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1938–1943, 2022

1938
[26]

Scalable person re-identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE international con- ference on computer vision, pages 1116–1124, 2015

2015
[27]

Hierarchical prompt learning for image-and text-based person re-identification.arXiv preprint arXiv:2511.13575, 2025

Linhan Zhou, Shuang Li, Neng Dong, Yonghang Tai, Yafei Zhang, and Huafeng Li. Hierarchical prompt learning for image-and text-based person re-identification.arXiv preprint arXiv:2511.13575, 2025

work page arXiv 2025
[28]

Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. InProceedings of the 29th ACM international conference on multimedia, pages 209–217, 2021

2021
[29]

Pass: Part-aware self-supervised pre- training for person re-identification

Kuan Zhu, Haiyun Guo, Tianyi Yan, Yousong Zhu, Jinqiao Wang, and Ming Tang. Pass: Part-aware self-supervised pre- training for person re-identification. InEuropean conference on computer vision, pages 198–214. Springer, 2022

2022
[30]

Plip: Language-image pre-training for person rep- resentation learning.Advances in Neural Information Pro- cessing Systems, 37:45666–45702, 2024

Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, and Jingdong Wang. Plip: Language-image pre-training for person rep- resentation learning.Advances in Neural Information Pro- cessing Systems, 37:45666–45702, 2024

2024
[31]

Ufinebench: Towards text-based person retrieval with ultra- fine granularity

Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, and Changxin Gao. Ufinebench: Towards text-based person retrieval with ultra- fine granularity. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 22010–22019, 2024. 8

2024

[1] [1]

Rasa: Relation and sensitivity aware representation learning for text-based person search.arXiv preprint arXiv:2305.13653, 2023

Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: Relation and sensitivity aware representation learning for text-based person search.arXiv preprint arXiv:2305.13653, 2023

work page arXiv 2023

[2] [2]

Semantically self-aligned network for text-to-image part-aware person re-identification

Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to-image part-aware person re-identification. arxiv 2021.arXiv preprint arXiv:2107.12666, 2021

work page arXiv 2021

[3] [3]

Mars: Paying more attention to visual attributes for text-based person search.ACM Transac- tions on Multimedia Computing, Communications and Ap- plications, 21(10):1–22, 2025

Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, and Andrea Prati. Mars: Paying more attention to visual attributes for text-based person search.ACM Transac- tions on Multimedia Computing, Communications and Ap- plications, 21(10):1–22, 2025

2025

[4] [4]

Large-scale pre-training for person re-identification with noisy labels

Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang Li, Fang Wen, and Dong Chen. Large-scale pre-training for person re-identification with noisy labels. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2476–2486, 2022

2022

[5] [5]

Transreid: Transformer-based object re- identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15013–15022, 2021

2021

[6] [6]

Instruct-reid: A multi-purpose person re-identification task with instructions

Weizhen He, Yiheng Deng, Shixiang Tang, Qihao Chen, Qingsong Xie, Yizhou Wang, Lei Bai, Feng Zhu, Rui Zhao, Wanli Ouyang, et al. Instruct-reid: A multi-purpose person re-identification task with instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17521–17531, 2024

2024

[7] [7]

Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2787–2797, 2023

2023

[8] [8]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InProceedings of the AAAI confer- ence on artificial intelligence, volume 37, pages 1405–1413, 2023

2023

[9] [9]

Person search with natural lan- guage description

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1970–1979, 2017

1970

[10] [10]

Deep- reid: Deep filter pairing neural network for person re- identification

Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep- reid: Deep filter pairing neural network for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 152–159, 2014

2014

[11] [11]

Dynamix: Generalizable person re-identification via dy- namic relabeling and mixed data sampling.Neurocomputing, page 132446, 2025

Timur Mamedov, Anton Konushin, and Vadim Konushin. Dynamix: Generalizable person re-identification via dy- namic relabeling and mixed data sampling.Neurocomputing, page 132446, 2025

2025

[12] [12]

Remix: Training generalized person re-identification on a mixture of data

Timur Mamedov, Anton Konushin, and Vadim Konushin. Remix: Training generalized person re-identification on a mixture of data. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8186–8196. IEEE Computer Society, 2025

2025

[13] [13]

Retext: Text boosts generalization in image-based person re-identification.arXiv:2602.05785, 2026

Timur Mamedov, Karina Kvanchiani, Anton Konushin, and Vadim Konushin. Retext: Text boosts generalization in image-based person re-identification.arXiv:2602.05785, 2026

work page arXiv 2026

[14] [14]

A culturally-aware benchmark for person re-identification in modest attire.Engineering Ap- plications of Artificial Intelligence, 158:111494, 2025

Alireza Sedighi Moghaddam, Fatemeh Anvari, Mohammad- javad Mirshekari Haghighi, Mohammadali Fakhari, and Mo- hammad Reza Mohammadi. A culturally-aware benchmark for person re-identification in modest attire.Engineering Ap- plications of Artificial Intelligence, 158:111494, 2025

2025

[15] [15]

Noisy-correspondence learning for text-to-image person re-identification

Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024

2024

[16] [16]

Learnable pillar-based re- ranking for image-text retrieval

Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, and Tat-Seng Chua. Learnable pillar-based re- ranking for image-text retrieval. InProceedings of the 46th international ACM SIGIR conference on research and devel- opment in information retrieval, pages 1252–1261, 2023

2023

[17] [17]

Soft contrastive cross-modal re- trieval.Applied Sciences, 14(5):1944, 2024

Jiayu Song, Yuxuan Hu, Lei Zhu, Chengyuan Zhang, Jian Zhang, and Shichao Zhang. Soft contrastive cross-modal re- trieval.Applied Sciences, 14(5):1944, 2024

1944

[18] [18]

Panda: A gigapixel- level human-centric video dataset

Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David Brady, Qionghai Dai, et al. Panda: A gigapixel- level human-centric video dataset. InProceedings of 7 the IEEE/CVF conference on computer vision and pattern recognition, pages 3268–3278, 2020

2020

[19] [19]

Person transfer gan to bridge domain gap for person re- identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018

2018

[20] [20]

Joint detection and identification feature learn- ing for person search

Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiao- gang Wang. Joint detection and identification feature learn- ing for person search. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3415– 3424, 2017

2017

[21] [21]

Entire-id: An exten- sive and diverse dataset for person re-identification

Serdar Yıldız and Ahmet Nezih Kasım. Entire-id: An exten- sive and diverse dataset for person re-identification. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–5. IEEE, 2024

2024

[22] [22]

Image–text cross-modal retrieval with instance con- trastive embedding.Electronics, 13(2):300, 2024

Ruigeng Zeng, Wentao Ma, Xiaoqian Wu, Wei Liu, and Jie Liu. Image–text cross-modal retrieval with instance con- trastive embedding.Electronics, 13(2):300, 2024

2024

[23] [23]

An efficient cross-modal privacy-preserving image–text re- trieval scheme.Symmetry, 16(8):1084, 2024

Kejun Zhang, Shaofei Xu, Yutuo Song, Yuwei Xu, Pengcheng Li, Xiang Yang, Bing Zou, and Wenbin Wang. An efficient cross-modal privacy-preserving image–text re- trieval scheme.Symmetry, 16(8):1084, 2024

2024

[24] [24]

An open-world, diverse, cross-spatial-temporal bench- mark for dynamic wild person re-identification.Interna- tional Journal of Computer Vision, 132(9):3823–3846, 2024

Lei Zhang, Xiaowei Fu, Fuxiang Huang, Yi Yang, and Xinbo Gao. An open-world, diverse, cross-spatial-temporal bench- mark for dynamic wild person re-identification.Interna- tional Journal of Computer Vision, 132(9):3823–3846, 2024

2024

[25] [25]

Image-text retrieval via contrastive learning with auxiliary generative features and support-set regularization

Lei Zhang, Min Yang, Chengming Li, and Ruifeng Xu. Image-text retrieval via contrastive learning with auxiliary generative features and support-set regularization. InPro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1938–1943, 2022

1938

[26] [26]

Scalable person re-identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE international con- ference on computer vision, pages 1116–1124, 2015

2015

[27] [27]

Hierarchical prompt learning for image-and text-based person re-identification.arXiv preprint arXiv:2511.13575, 2025

Linhan Zhou, Shuang Li, Neng Dong, Yonghang Tai, Yafei Zhang, and Huafeng Li. Hierarchical prompt learning for image-and text-based person re-identification.arXiv preprint arXiv:2511.13575, 2025

work page arXiv 2025

[28] [28]

Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. InProceedings of the 29th ACM international conference on multimedia, pages 209–217, 2021

2021

[29] [29]

Pass: Part-aware self-supervised pre- training for person re-identification

Kuan Zhu, Haiyun Guo, Tianyi Yan, Yousong Zhu, Jinqiao Wang, and Ming Tang. Pass: Part-aware self-supervised pre- training for person re-identification. InEuropean conference on computer vision, pages 198–214. Springer, 2022

2022

[30] [30]

Plip: Language-image pre-training for person rep- resentation learning.Advances in Neural Information Pro- cessing Systems, 37:45666–45702, 2024

Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, and Jingdong Wang. Plip: Language-image pre-training for person rep- resentation learning.Advances in Neural Information Pro- cessing Systems, 37:45666–45702, 2024

2024

[31] [31]

Ufinebench: Towards text-based person retrieval with ultra- fine granularity

Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, and Changxin Gao. Ufinebench: Towards text-based person retrieval with ultra- fine granularity. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 22010–22019, 2024. 8

2024