pith. sign in

arxiv: 2503.22171 · v2 · submitted 2025-03-28 · 💻 cs.CV

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

Pith reviewed 2026-05-22 22:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic datatext-based person retrievaldata synthesis pipelineimage generationtext generationprivacy preservationdata augmentationempirical study
0
0 comments X

The pith

A fully synthetic data pipeline can serve as a standalone replacement or augmentation to real data for training text-based person retrieval models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a data synthesis method that creates person images and matching text descriptions without using any real person images or manual labels. It generates diverse identities through automatic prompt-based image creation and adds within-identity variations via text-driven editing, then produces corresponding text descriptions automatically. Extensive experiments test this synthetic data across multiple scenarios to measure how well it trains retrieval models that are evaluated on real test sets. The findings establish that the synthetic data can train competitive models on its own or improve performance when mixed with real data. This approach addresses the privacy and labeling costs that limit real-data collection for such systems.

Core claim

The paper claims that a unified synthesis pipeline operating entirely without real person data produces training examples whose practical utility is demonstrated through experiments: an inter-class module creates diverse identity-centric images via automatic prompt construction, an intra-class module increases identity variation via text-driven image editing, and automatic text generation supplies the paired descriptions, allowing the resulting data to function either as a complete replacement for real data or as a complementary augmentation.

What carries the argument

The unified data synthesis pipeline that combines inter-class image generation via automatic prompt construction with intra-class augmentation via text-driven image editing, plus automatic textual description generation.

If this is right

  • Models trained solely on the synthetic data achieve competitive retrieval accuracy on real test sets.
  • Mixing the synthetic data with real data produces higher performance than real data alone in the tested scenarios.
  • The method removes the requirement to collect real person images or obtain manual textual annotations.
  • The pipeline supports systematic testing of synthetic data effectiveness across a range of real-world retrieval conditions.
  • Automatic generation enables production of arbitrarily large training sets without additional human effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generation approach could be adapted to reduce data collection needs in other vision-language retrieval tasks.
  • If the synthetic data preserves or increases diversity, it might help address dataset biases that appear in real collections.
  • Scaling the pipeline with different image generators could further improve the quality gap to real data.
  • Widespread adoption would lower the infrastructure cost of deploying text-based person retrieval systems.

Load-bearing premise

The generated synthetic images and descriptions are realistic and diverse enough that models trained on them achieve performance on real test data that reflects actual usefulness.

What would settle it

If models trained only on the synthetic data achieve substantially lower accuracy than real-data baselines on standard real-world benchmarks such as CUHK-PEDES, the claim of practical utility as a replacement would be falsified.

Figures

Figures reproduced from arXiv: 2503.22171 by Dong Yi, Jinqiao Wang, Mang Ye, Min Cao, Yuxin Lu, Ziyin Zeng.

Figure 1
Figure 1. Figure 1: Data production paradigms for TBPR model training. (a) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of our framework for validating synthetic data for TBPR. It involve the following steps. (1) Inter-class image genera [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance trend under different value of the guidance [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of real data (a) and synthetic data (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Descriptor lists used in the rough description templates for inter-class image generation. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Some templates used in generating the textual description of synthetic image. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance trend under different number of synthetic [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance trend under different number of synthetic data and different number of real data on CUHK-PEDES. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of noisy images. Noise typically arises from [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of noisy texts. Noise usually manifests as inappropriate symbols or irrelevant information within the text. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of synthetic data. The images are generated from the proposed inter-class image generation pipeline under scenario [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of editing images [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of real data (CUHK-PEDES) and synthetic data. Synthetic data is shown from other method [ [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy concerns and annotation burdens. Several pioneering efforts explore synthetic data generation, and yet still depend on real data as a foundation, inheriting the same limitations. The feasibility of purely synthetic TBPR data remains unexplored, and there is currently no systematic study on the effectiveness boundaries of synthetic data across various real-world scenarios. In this work, we present the first comprehensive empirical study of synthetic data for TBPR, with two key aspects. (1) We propose a unified data synthesis pipeline that can operate entirely without real person data. It combines an inter-class image generation module that produces diverse identity-centric images by means of an automatic prompt construction strategy, and an intra-class augmentation module that enhances identity variation through text-driven image editing. (2) Leveraging this pipeline and an automatic textual description generation, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments, to reveal its practical utility as either a standalone replacement or a complementary augmentation to real data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first comprehensive empirical study of synthetic data for Text-Based Person Retrieval (TBPR). It proposes a unified synthesis pipeline that generates both images and descriptions entirely without real person data, via an inter-class module using automatic prompt construction for diverse identities and an intra-class module using text-driven editing for variation. Automatic textual description generation is also included. Extensive experiments across scenarios are used to assess whether synthetic data can serve as a standalone replacement or complementary augmentation to real data.

Significance. If the results hold, the work could meaningfully reduce privacy and annotation burdens in TBPR by establishing empirical boundaries for synthetic-data utility. Credit is given for conducting the first systematic study of a fully synthetic pipeline and for running experiments across multiple scenarios rather than a single setting.

major comments (2)
  1. [§3 (unified data synthesis pipeline)] The central claim that the pipeline produces data 'sufficiently realistic and diverse' for practical utility rests solely on downstream retrieval metrics on real test sets. No independent validation of the synthetic distribution (FID, attribute-level fidelity, or human realism ratings) is described for the inter-class generation or intra-class augmentation modules.
  2. [§4 (experiments)] Because the generative models are pre-trained on real distributions, end-task performance alone cannot isolate whether success stems from the proposed automatic prompt and editing strategy or from the base generators. An ablation or control (e.g., random prompts or non-person-specific editing) is needed to support the 'practical utility' conclusion.
minor comments (2)
  1. [Abstract] The abstract states that experiments explore 'diverse scenarios' but does not enumerate them; a brief list would improve clarity.
  2. [§3.1] Notation for the automatic prompt construction strategy and text-driven editing operations should be introduced with explicit symbols or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, agreeing where the manuscript requires strengthening and outlining the planned changes.

read point-by-point responses
  1. Referee: [§3 (unified data synthesis pipeline)] The central claim that the pipeline produces data 'sufficiently realistic and diverse' for practical utility rests solely on downstream retrieval metrics on real test sets. No independent validation of the synthetic distribution (FID, attribute-level fidelity, or human realism ratings) is described for the inter-class generation or intra-class augmentation modules.

    Authors: We agree that the original manuscript relies exclusively on downstream TBPR metrics to support claims of sufficient realism and diversity. Independent metrics such as FID, attribute-level analysis, or human ratings are absent. In the revised version we will add FID comparisons between the synthetic images (both inter-class and intra-class) and real distributions, together with attribute-level fidelity checks where automatic attribute labels can be obtained from the generation process. revision: yes

  2. Referee: [§4 (experiments)] Because the generative models are pre-trained on real distributions, end-task performance alone cannot isolate whether success stems from the proposed automatic prompt and editing strategy or from the base generators. An ablation or control (e.g., random prompts or non-person-specific editing) is needed to support the 'practical utility' conclusion.

    Authors: The referee correctly identifies that the current experiments do not isolate the contribution of the automatic prompt construction and text-driven editing modules from the capabilities of the underlying pre-trained generators. We will add the requested control ablations in the revision: (i) inter-class generation with random prompts instead of our structured identity-centric prompts, and (ii) intra-class augmentation with non-person-specific editing instructions, reporting the resulting retrieval performance to demonstrate the incremental benefit of the proposed strategies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study relies on experimental comparisons, not derivations or self-referential definitions

full rationale

The paper is a purely empirical study of synthetic data for TBPR. It proposes a data synthesis pipeline and evaluates it via experiments on real test sets, with no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. All claims rest on direct performance comparisons between synthetic and real data setups. No step reduces by construction to its own inputs, satisfying the criteria for score 0 with an empty steps list.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical validation study and introduces no free parameters, mathematical axioms, or new postulated entities; it relies on existing generative models whose internal details are treated as black boxes.

pith-pipeline@v0.9.0 · 5737 in / 1054 out tokens · 112200 ms · 2026-05-22T22:52:48.628467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 5, 4

  2. [2]

    Rasa: Relation and sensitivity aware representation learning for text-based person search

    Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023. 2, 5, 6, 7

  3. [3]

    Looking be- yond appearances: Synthetic training data for deep cnns in re-identification

    Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Alek- sander Rognhaugen, and Theoharis Theoharis. Looking be- yond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understand- ing, 167:50–62, 2018. 2

  4. [4]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 7

  5. [5]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 5, 7, 1

  6. [6]

    An empirical study of clip for text-based person search

    Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, and Min Zhang. An empirical study of clip for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 465–473, 2024. 1, 2, 5, 6

  7. [7]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 7

  8. [8]

    Improving text-based person search by spatial matching and adaptive threshold

    Tianlang Chen, Chenliang Xu, and Jiebo Luo. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1879–1887. IEEE, 2018. 2

  9. [9]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 5, 8, 2, 4

  10. [10]

    Noise map guidance: Inversion with spatial context for real image editing

    Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, and Yonghyun Jeong. Noise map guidance: Inversion with spatial context for real image editing. arXiv preprint arXiv:2402.04625, 2024. 4, 1

  11. [11]

    Semantically self-aligned network for text-to- image part-aware person re-identification

    Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021. 2, 6

  12. [12]

    Using language to extend to unseen do- mains

    Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E Gonzalez, Aditi Raghunathan, and Anna Rohrbach. Using language to extend to unseen do- mains. International Conference on Learning Representa- tions (ICLR), 2023. 1, 2

  13. [13]

    Scaling laws of synthetic images for model training

    Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024. 2, 7

  14. [14]

    Axm-net: Implicit cross-modal fea- ture alignment for person re-identification

    Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. Axm-net: Implicit cross-modal fea- ture alignment for person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4477– 4485, 2022. 2

  15. [15]

    Unsuper- vised pre-training for person re-identification

    Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsuper- vised pre-training for person re-identification. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14750–14759, 2021. 1, 3

  16. [16]

    Bilma: Bidirectional local-matching for text-based person re-identification

    Takuro Fujii and Shuhei Tarashima. Bilma: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2786–2790, 2023. 7

  17. [17]

    Semi-supervised text-based person search

    Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, and Min Zhang. Semi-supervised text-based person search. arXiv preprint arXiv:2404.18106, 2024. 6

  18. [18]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2

  19. [19]

    Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

    Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022. 1, 2

  20. [20]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4, 5, 7, 1

  21. [21]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in Neural Information Processing Systems , 30, 2017. 7

  22. [22]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 4

  23. [23]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 5, 4

  24. [24]

    Tag2text: Guiding vision-language model via image tagging

    Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023. 1

  25. [25]

    Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

    Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2787–2797, 2023. 1, 2, 5, 6, 7

  26. [26]

    Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

    In `es Hyeonsu Kim, JoungBin Lee, Soowon Son, Woo- jeong Jin, Kyusun Cho, Junyoung Seo, Min-Seop Kwak, Seokju Cho, JeongYeol Baek, Byeongwon Lee, et al. Pose- diversified augmentation with diffusion model for person re- identification. arXiv preprint arXiv:2406.16042, 2024. 2

  27. [27]

    Person search with natural lan- guage description

    Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1970– 1979, 2017. 1, 2, 6

  28. [28]

    Learning semantic- aligned feature representation for text-based person search

    Shiping Li, Min Cao, and Min Zhang. Learning semantic- aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2724–2728. IEEE, 2022. 2, 7

  29. [29]

    Adaptive uncertainty-based learning for text-based person retrieval

    Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen. Adaptive uncertainty-based learning for text-based person retrieval. In Proceedings of the AAAI Con- ference on Artificial Intelligence, pages 3172–3180, 2024. 1, 6, 7, 5

  30. [30]

    Cross-modal adaptive dual association for text-to-image per- son retrieval

    Dixuan Lin, Yixing Peng, Jingke Meng, and Wei-Shi Zheng. Cross-modal adaptive dual association for text-to-image per- son retrieval. IEEE Transactions on Multimedia, 2024. 7

  31. [31]

    Causality-inspired invariant representation learning for text-based person retrieval

    Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, and Xun Yang. Causality-inspired invariant representation learning for text-based person retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 14052–14060,

  32. [32]

    Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023. 4

  33. [33]

    Synthesizing efficient data with diffu- sion models for person re-identification pre-training

    Ke Niu, Haiyang Yu, Xuelin Qian, Teng Fu, Bin Li, and Xiangyang Xue. Synthesizing efficient data with diffu- sion models for person re-identification pre-training. arXiv preprint arXiv:2406.06045, 2024. 2

  34. [34]

    Noisy-correspondence learning for text-to-image person re-identification

    Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024. 6

  35. [35]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2

  36. [36]

    Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023

    Dillon Reis, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023. 5

  37. [37]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 4, 7

  38. [38]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 4

  39. [39]

    Learning granularity-unified representations for text-to-image person re-identification

    Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th Acm International Conference on Multimedia, pages 5566–5574, 2022. 2

  40. [40]

    Unified pre-training with pseudo texts for text-to-image person re-identification

    Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11174–11184, 2023. 1, 2, 3, 6, 7

  41. [41]

    See finer, see more: Implicit modality alignment for text-based person retrieval

    Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision , pages 624–

  42. [42]

    Springer, 2022. 2, 7

  43. [43]

    Diverse person: Customize your own dataset for text-based person search

    Zifan Song, Guosheng Hu, and Cairong Zhao. Diverse person: Customize your own dataset for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4943–4951, 2024. 1, 2, 3, 4

  44. [44]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 5

  45. [45]

    Dissecting person re- identification from the viewpoint of viewpoint

    Xiaoxiao Sun and Liang Zheng. Dissecting person re- identification from the viewpoint of viewpoint. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 608–617, 2019. 2, 6, 9

  46. [46]

    Harnessing the power of mllms for transferable text-to-image person reid

    Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yib- ing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17127–17137, 2024. 1, 2, 3, 5, 6, 7

  47. [47]

    Learning attention-guided pyramidal features for few-shot fine-grained recognition

    Hao Tang, Chengcheng Yuan, Zechao Li, and Jinhui Tang. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130:108792,

  48. [48]

    Stablerep: Synthetic images from text-to- image models make strong visual representation learners

    Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to- image models make strong visual representation learners. Advances in Neural Information Processing Systems , 36,

  49. [49]

    Surpassing real-world source training data: Random 3d characters for generalizable person re-identification

    Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia , pages 3422–3430, 2020. 2, 6

  50. [50]

    Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification

    Yanan Wang, Xuezhi Liang, and Shengcai Liao. Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4900–4909, 2022. 2

  51. [51]

    Vi- taa: Visual-textual attributes alignment in person search by natural language

    Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vi- taa: Visual-textual attributes alignment in person search by natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 402–420. Springer, 2020. 2, 7

  52. [52]

    Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold

    Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1984–1992,

  53. [53]

    Person transfer gan to bridge domain gap for person re- identification

    Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 79–88,

  54. [54]

    Contrastive transformer learning with proximity data generation for text-based per- son search

    Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen, and Liang Lin. Contrastive transformer learning with proximity data generation for text-based per- son search. IEEE Transactions on Circuits and Systems for Video Technology, 2023. 1, 2, 3, 4

  55. [55]

    Lapscore: language- guided person search via color reasoning

    Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. Lapscore: language- guided person search via color reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1624–1633, 2021. 7

  56. [56]

    Laip: Learning local alignment from image-phrase modeling for text-based person search

    Yu Wu, Haiguang Wang, Mengxia Wu, Min Cao, and Min Zhang. Laip: Learning local alignment from image-phrase modeling for text-based person search. In 2024 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–10. IEEE, 2024. 7

  57. [57]

    Refined knowledge transfer for language-based person search

    Ziqiang Wu, Bingpeng Ma, Hong Chang, and Shiguang Shan. Refined knowledge transfer for language-based person search. IEEE Transactions on Multimedia , 25:9315–9329,

  58. [58]

    Less is more: Learning from synthetic data with fine-grained attributes for person re-identification

    Suncheng Xiang, Dahong Qian, Mengyuan Guan, Binjie Yan, Ting Liu, Yuzhuo Fu, and Guanjie You. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multime- dia Computing, Communications and Applications , 19(5s): 1–20, 2023. 2

  59. [59]

    Image-specific information suppression and implicit local alignment for text-based person search

    Shuanglin Yan, Hao Tang, Liyan Zhang, and Jinhui Tang. Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems, 2023. 2

  60. [60]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 1, 2, 4

  61. [61]

    Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark

    Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023. 1, 2, 3, 6, 7, 5

  62. [62]

    Unrealperson: An adaptive pipeline towards costless person re-identification

    Tianyu Zhang, Lingxi Xie, Longhui Wei, Zijie Zhuang, Yongfei Zhang, Bo Li, and Qi Tian. Unrealperson: An adaptive pipeline towards costless person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11506–11515, 2021. 2

  63. [63]

    Deep cross-modal projection learning for image-text matching

    Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 686– 701, 2018. 2

  64. [64]

    Joint discriminative and genera- tive learning for person re-identification

    Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and genera- tive learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2138–2147, 2019. 6, 9

  65. [65]

    Camstyle: A novel data augmentation method for person re-identification

    Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. Camstyle: A novel data augmentation method for person re-identification. IEEE Transactions on Image Pro- cessing, 28(3):1176–1190, 2018. 2

  66. [66]

    Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

    Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 209–217, 2021. 2, 6, 3 An Empirical Study of Validating Synthetic Data for Text-Based Person Retriev...