An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

Dong Yi; Jinqiao Wang; Mang Ye; Min Cao; Yuxin Lu; Ziyin Zeng

arxiv: 2503.22171 · v2 · submitted 2025-03-28 · 💻 cs.CV

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

Min Cao , Yuxin Lu , Ziyin Zeng , Dong Yi , Jinqiao Wang , Mang Ye This is my paper

Pith reviewed 2026-05-22 22:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords synthetic datatext-based person retrievaldata synthesis pipelineimage generationtext generationprivacy preservationdata augmentationempirical study

0 comments

The pith

A fully synthetic data pipeline can serve as a standalone replacement or augmentation to real data for training text-based person retrieval models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a data synthesis method that creates person images and matching text descriptions without using any real person images or manual labels. It generates diverse identities through automatic prompt-based image creation and adds within-identity variations via text-driven editing, then produces corresponding text descriptions automatically. Extensive experiments test this synthetic data across multiple scenarios to measure how well it trains retrieval models that are evaluated on real test sets. The findings establish that the synthetic data can train competitive models on its own or improve performance when mixed with real data. This approach addresses the privacy and labeling costs that limit real-data collection for such systems.

Core claim

The paper claims that a unified synthesis pipeline operating entirely without real person data produces training examples whose practical utility is demonstrated through experiments: an inter-class module creates diverse identity-centric images via automatic prompt construction, an intra-class module increases identity variation via text-driven image editing, and automatic text generation supplies the paired descriptions, allowing the resulting data to function either as a complete replacement for real data or as a complementary augmentation.

What carries the argument

The unified data synthesis pipeline that combines inter-class image generation via automatic prompt construction with intra-class augmentation via text-driven image editing, plus automatic textual description generation.

If this is right

Models trained solely on the synthetic data achieve competitive retrieval accuracy on real test sets.
Mixing the synthetic data with real data produces higher performance than real data alone in the tested scenarios.
The method removes the requirement to collect real person images or obtain manual textual annotations.
The pipeline supports systematic testing of synthetic data effectiveness across a range of real-world retrieval conditions.
Automatic generation enables production of arbitrarily large training sets without additional human effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generation approach could be adapted to reduce data collection needs in other vision-language retrieval tasks.
If the synthetic data preserves or increases diversity, it might help address dataset biases that appear in real collections.
Scaling the pipeline with different image generators could further improve the quality gap to real data.
Widespread adoption would lower the infrastructure cost of deploying text-based person retrieval systems.

Load-bearing premise

The generated synthetic images and descriptions are realistic and diverse enough that models trained on them achieve performance on real test data that reflects actual usefulness.

What would settle it

If models trained only on the synthetic data achieve substantially lower accuracy than real-data baselines on standard real-world benchmarks such as CUHK-PEDES, the claim of practical utility as a replacement would be falsified.

Figures

Figures reproduced from arXiv: 2503.22171 by Dong Yi, Jinqiao Wang, Mang Ye, Min Cao, Yuxin Lu, Ziyin Zeng.

**Figure 2.** Figure 2: Workflow of our framework for validating synthetic data for TBPR. It involve the following steps. (1) Inter-class image genera [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance trend under different value of the guidance [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of real data (a) and synthetic data (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Descriptor lists used in the rough description templates for inter-class image generation. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Some templates used in generating the textual description of synthetic image. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Performance trend under different number of synthetic [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Performance trend under different number of synthetic data and different number of real data on CUHK-PEDES. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Illustration of noisy images. Noise typically arises from [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of noisy texts. Noise usually manifests as inappropriate symbols or irrelevant information within the text. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of synthetic data. The images are generated from the proposed inter-class image generation pipeline under scenario [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Illustration of editing images [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Illustration of real data (CUHK-PEDES) and synthetic data. Synthetic data is shown from other method [ [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy concerns and annotation burdens. Several pioneering efforts explore synthetic data generation, and yet still depend on real data as a foundation, inheriting the same limitations. The feasibility of purely synthetic TBPR data remains unexplored, and there is currently no systematic study on the effectiveness boundaries of synthetic data across various real-world scenarios. In this work, we present the first comprehensive empirical study of synthetic data for TBPR, with two key aspects. (1) We propose a unified data synthesis pipeline that can operate entirely without real person data. It combines an inter-class image generation module that produces diverse identity-centric images by means of an automatic prompt construction strategy, and an intra-class augmentation module that enhances identity variation through text-driven image editing. (2) Leveraging this pipeline and an automatic textual description generation, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments, to reveal its practical utility as either a standalone replacement or a complementary augmentation to real data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first empirical mapping of purely synthetic data for TBPR via a no-real-data pipeline, but the utility claims rest only on downstream retrieval scores without separate checks on image realism or diversity.

read the letter

The paper's main move is to build and test a pipeline that generates both images and text descriptions for text-based person retrieval without touching any real person photos. It splits the work into inter-class generation through automatic prompts and intra-class variation via text-driven edits, then runs experiments to see when the output can replace real data or just supplement it across different scenarios. That setup is new relative to earlier synthetic efforts that still started from real images, and the systematic scenario testing gives a clearer picture of practical boundaries than prior work. The privacy angle is handled directly by design, which is a clean response to the annotation and consent problems in the field. The experiments are described as extensive, so the results on replacement versus augmentation should be worth looking at for anyone already running TBPR models. The soft spot is exactly the one flagged in the stress test: there is no independent evidence that the synthetic images are realistic or diverse enough on their own. No FID scores, attribute fidelity checks, or human ratings are mentioned, only the end-task retrieval numbers on real test sets. That leaves the central assumption untested and opens the door to the generative models simply replaying patterns from their own real-data pretraining. The paper is aimed at retrieval researchers who need data without privacy overhead. A reader working on synthetic augmentation in vision would pick up concrete pipeline details and scenario results. It is solid enough on the empirical framing and the privacy motivation to go to a serious referee, though the authors will need to add direct data-quality validation to make the replacement claims stick.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first comprehensive empirical study of synthetic data for Text-Based Person Retrieval (TBPR). It proposes a unified synthesis pipeline that generates both images and descriptions entirely without real person data, via an inter-class module using automatic prompt construction for diverse identities and an intra-class module using text-driven editing for variation. Automatic textual description generation is also included. Extensive experiments across scenarios are used to assess whether synthetic data can serve as a standalone replacement or complementary augmentation to real data.

Significance. If the results hold, the work could meaningfully reduce privacy and annotation burdens in TBPR by establishing empirical boundaries for synthetic-data utility. Credit is given for conducting the first systematic study of a fully synthetic pipeline and for running experiments across multiple scenarios rather than a single setting.

major comments (2)

[§3 (unified data synthesis pipeline)] The central claim that the pipeline produces data 'sufficiently realistic and diverse' for practical utility rests solely on downstream retrieval metrics on real test sets. No independent validation of the synthetic distribution (FID, attribute-level fidelity, or human realism ratings) is described for the inter-class generation or intra-class augmentation modules.
[§4 (experiments)] Because the generative models are pre-trained on real distributions, end-task performance alone cannot isolate whether success stems from the proposed automatic prompt and editing strategy or from the base generators. An ablation or control (e.g., random prompts or non-person-specific editing) is needed to support the 'practical utility' conclusion.

minor comments (2)

[Abstract] The abstract states that experiments explore 'diverse scenarios' but does not enumerate them; a brief list would improve clarity.
[§3.1] Notation for the automatic prompt construction strategy and text-driven editing operations should be introduced with explicit symbols or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, agreeing where the manuscript requires strengthening and outlining the planned changes.

read point-by-point responses

Referee: [§3 (unified data synthesis pipeline)] The central claim that the pipeline produces data 'sufficiently realistic and diverse' for practical utility rests solely on downstream retrieval metrics on real test sets. No independent validation of the synthetic distribution (FID, attribute-level fidelity, or human realism ratings) is described for the inter-class generation or intra-class augmentation modules.

Authors: We agree that the original manuscript relies exclusively on downstream TBPR metrics to support claims of sufficient realism and diversity. Independent metrics such as FID, attribute-level analysis, or human ratings are absent. In the revised version we will add FID comparisons between the synthetic images (both inter-class and intra-class) and real distributions, together with attribute-level fidelity checks where automatic attribute labels can be obtained from the generation process. revision: yes
Referee: [§4 (experiments)] Because the generative models are pre-trained on real distributions, end-task performance alone cannot isolate whether success stems from the proposed automatic prompt and editing strategy or from the base generators. An ablation or control (e.g., random prompts or non-person-specific editing) is needed to support the 'practical utility' conclusion.

Authors: The referee correctly identifies that the current experiments do not isolate the contribution of the automatic prompt construction and text-driven editing modules from the capabilities of the underlying pre-trained generators. We will add the requested control ablations in the revision: (i) inter-class generation with random prompts instead of our structured identity-centric prompts, and (ii) intra-class augmentation with non-person-specific editing instructions, reporting the resulting retrieval performance to demonstrate the incremental benefit of the proposed strategies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study relies on experimental comparisons, not derivations or self-referential definitions

full rationale

The paper is a purely empirical study of synthetic data for TBPR. It proposes a data synthesis pipeline and evaluates it via experiments on real test sets, with no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. All claims rest on direct performance comparisons between synthetic and real data setups. No step reduces by construction to its own inputs, satisfying the criteria for score 0 with an empty steps list.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical validation study and introduces no free parameters, mathematical axioms, or new postulated entities; it relies on existing generative models whose internal details are treated as black boxes.

pith-pipeline@v0.9.0 · 5737 in / 1054 out tokens · 112200 ms · 2026-05-22T22:52:48.628467+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an inter-class image generation pipeline... automatic prompt construction strategy... intra-class image augmentation pipeline... three types of edits... automatic text generation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Study on the effectiveness of synthetic data... three scenarios (S1: No data, S2: Limited data, S3: Abundant data)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Rasa: Relation and sensitivity aware representation learning for text-based person search

Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023. 2, 5, 6, 7

work page arXiv 2023
[3]

Looking be- yond appearances: Synthetic training data for deep cnns in re-identification

Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Alek- sander Rognhaugen, and Theoharis Theoharis. Looking be- yond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understand- ing, 167:50–62, 2018. 2

work page 2018
[4]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 7

work page 2023
[5]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 5, 7, 1

work page 2023
[6]

An empirical study of clip for text-based person search

Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, and Min Zhang. An empirical study of clip for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 465–473, 2024. 1, 2, 5, 6

work page 2024
[7]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Improving text-based person search by spatial matching and adaptive threshold

Tianlang Chen, Chenliang Xu, and Jiebo Luo. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1879–1887. IEEE, 2018. 2

work page 2018
[9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 5, 8, 2, 4

work page 2024
[10]

Noise map guidance: Inversion with spatial context for real image editing

Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, and Yonghyun Jeong. Noise map guidance: Inversion with spatial context for real image editing. arXiv preprint arXiv:2402.04625, 2024. 4, 1

work page arXiv 2024
[11]

Semantically self-aligned network for text-to- image part-aware person re-identification

Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021. 2, 6

work page arXiv 2021
[12]

Using language to extend to unseen do- mains

Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E Gonzalez, Aditi Raghunathan, and Anna Rohrbach. Using language to extend to unseen do- mains. International Conference on Learning Representa- tions (ICLR), 2023. 1, 2

work page 2023
[13]

Scaling laws of synthetic images for model training

Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024. 2, 7

work page 2024
[14]

Axm-net: Implicit cross-modal fea- ture alignment for person re-identification

Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. Axm-net: Implicit cross-modal fea- ture alignment for person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4477– 4485, 2022. 2

work page 2022
[15]

Unsuper- vised pre-training for person re-identification

Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsuper- vised pre-training for person re-identification. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14750–14759, 2021. 1, 3

work page 2021
[16]

Bilma: Bidirectional local-matching for text-based person re-identification

Takuro Fujii and Shuhei Tarashima. Bilma: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2786–2790, 2023. 7

work page 2023
[17]

Semi-supervised text-based person search

Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, and Min Zhang. Semi-supervised text-based person search. arXiv preprint arXiv:2404.18106, 2024. 6

work page arXiv 2024
[18]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2

work page 2014
[19]

Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022. 1, 2

work page arXiv 2022
[20]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4, 5, 7, 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in Neural Information Processing Systems , 30, 2017. 7

work page 2017
[22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Tag2text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023. 1

work page arXiv 2023
[25]

Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2787–2797, 2023. 1, 2, 5, 6, 7

work page 2023
[26]

Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

In `es Hyeonsu Kim, JoungBin Lee, Soowon Son, Woo- jeong Jin, Kyusun Cho, Junyoung Seo, Min-Seop Kwak, Seokju Cho, JeongYeol Baek, Byeongwon Lee, et al. Pose- diversified augmentation with diffusion model for person re- identification. arXiv preprint arXiv:2406.16042, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Person search with natural lan- guage description

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1970– 1979, 2017. 1, 2, 6

work page 1970
[28]

Learning semantic- aligned feature representation for text-based person search

Shiping Li, Min Cao, and Min Zhang. Learning semantic- aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2724–2728. IEEE, 2022. 2, 7

work page 2022
[29]

Adaptive uncertainty-based learning for text-based person retrieval

Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen. Adaptive uncertainty-based learning for text-based person retrieval. In Proceedings of the AAAI Con- ference on Artificial Intelligence, pages 3172–3180, 2024. 1, 6, 7, 5

work page 2024
[30]

Cross-modal adaptive dual association for text-to-image per- son retrieval

Dixuan Lin, Yixing Peng, Jingke Meng, and Wei-Shi Zheng. Cross-modal adaptive dual association for text-to-image per- son retrieval. IEEE Transactions on Multimedia, 2024. 7

work page 2024
[31]

Causality-inspired invariant representation learning for text-based person retrieval

Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, and Xun Yang. Causality-inspired invariant representation learning for text-based person retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 14052–14060,

work page
[32]

Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023. 4

work page arXiv 2023
[33]

Synthesizing efficient data with diffu- sion models for person re-identification pre-training

Ke Niu, Haiyang Yu, Xuelin Qian, Teng Fu, Bin Li, and Xiangyang Xue. Synthesizing efficient data with diffu- sion models for person re-identification pre-training. arXiv preprint arXiv:2406.06045, 2024. 2

work page arXiv 2024
[34]

Noisy-correspondence learning for text-to-image person re-identification

Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024. 6

work page 2024
[35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2

work page 2021
[36]

Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023

Dillon Reis, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023. 5

work page arXiv 2023
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 4, 7

work page 2022
[38]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 4

work page 2023
[39]

Learning granularity-unified representations for text-to-image person re-identification

Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th Acm International Conference on Multimedia, pages 5566–5574, 2022. 2

work page 2022
[40]

Unified pre-training with pseudo texts for text-to-image person re-identification

Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11174–11184, 2023. 1, 2, 3, 6, 7

work page 2023
[41]

See finer, see more: Implicit modality alignment for text-based person retrieval

Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision , pages 624–

work page
[42]

Springer, 2022. 2, 7

work page 2022
[43]

Diverse person: Customize your own dataset for text-based person search

Zifan Song, Guosheng Hu, and Cairong Zhao. Diverse person: Customize your own dataset for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4943–4951, 2024. 1, 2, 3, 4

work page 2024
[44]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Dissecting person re- identification from the viewpoint of viewpoint

Xiaoxiao Sun and Liang Zheng. Dissecting person re- identification from the viewpoint of viewpoint. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 608–617, 2019. 2, 6, 9

work page 2019
[46]

Harnessing the power of mllms for transferable text-to-image person reid

Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yib- ing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17127–17137, 2024. 1, 2, 3, 5, 6, 7

work page 2024
[47]

Learning attention-guided pyramidal features for few-shot fine-grained recognition

Hao Tang, Chengcheng Yuan, Zechao Li, and Jinhui Tang. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130:108792,

work page
[48]

Stablerep: Synthetic images from text-to- image models make strong visual representation learners

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to- image models make strong visual representation learners. Advances in Neural Information Processing Systems , 36,

work page
[49]

Surpassing real-world source training data: Random 3d characters for generalizable person re-identification

Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia , pages 3422–3430, 2020. 2, 6

work page 2020
[50]

Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification

Yanan Wang, Xuezhi Liang, and Shengcai Liao. Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4900–4909, 2022. 2

work page 2022
[51]

Vi- taa: Visual-textual attributes alignment in person search by natural language

Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vi- taa: Visual-textual attributes alignment in person search by natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 402–420. Springer, 2020. 2, 7

work page 2020
[52]

Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold

Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1984–1992,

work page 1984
[53]

Person transfer gan to bridge domain gap for person re- identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 79–88,

work page
[54]

Contrastive transformer learning with proximity data generation for text-based per- son search

Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen, and Liang Lin. Contrastive transformer learning with proximity data generation for text-based per- son search. IEEE Transactions on Circuits and Systems for Video Technology, 2023. 1, 2, 3, 4

work page 2023
[55]

Lapscore: language- guided person search via color reasoning

Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. Lapscore: language- guided person search via color reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1624–1633, 2021. 7

work page 2021
[56]

Laip: Learning local alignment from image-phrase modeling for text-based person search

Yu Wu, Haiguang Wang, Mengxia Wu, Min Cao, and Min Zhang. Laip: Learning local alignment from image-phrase modeling for text-based person search. In 2024 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–10. IEEE, 2024. 7

work page 2024
[57]

Refined knowledge transfer for language-based person search

Ziqiang Wu, Bingpeng Ma, Hong Chang, and Shiguang Shan. Refined knowledge transfer for language-based person search. IEEE Transactions on Multimedia , 25:9315–9329,

work page
[58]

Less is more: Learning from synthetic data with fine-grained attributes for person re-identification

Suncheng Xiang, Dahong Qian, Mengyuan Guan, Binjie Yan, Ting Liu, Yuzhuo Fu, and Guanjie You. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multime- dia Computing, Communications and Applications , 19(5s): 1–20, 2023. 2

work page 2023
[59]

Image-specific information suppression and implicit local alignment for text-based person search

Shuanglin Yan, Hao Tang, Liyan Zhang, and Jinhui Tang. Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems, 2023. 2

work page 2023
[60]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark

Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023. 1, 2, 3, 6, 7, 5

work page 2023
[62]

Unrealperson: An adaptive pipeline towards costless person re-identification

Tianyu Zhang, Lingxi Xie, Longhui Wei, Zijie Zhuang, Yongfei Zhang, Bo Li, and Qi Tian. Unrealperson: An adaptive pipeline towards costless person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11506–11515, 2021. 2

work page 2021
[63]

Deep cross-modal projection learning for image-text matching

Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 686– 701, 2018. 2

work page 2018
[64]

Joint discriminative and genera- tive learning for person re-identification

Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and genera- tive learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2138–2147, 2019. 6, 9

work page 2019
[65]

Camstyle: A novel data augmentation method for person re-identification

Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. Camstyle: A novel data augmentation method for person re-identification. IEEE Transactions on Image Pro- cessing, 28(3):1176–1190, 2018. 2

work page 2018
[66]

Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 209–217, 2021. 2, 6, 3 An Empirical Study of Validating Synthetic Data for Text-Based Person Retriev...

work page 2021

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Rasa: Relation and sensitivity aware representation learning for text-based person search

Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023. 2, 5, 6, 7

work page arXiv 2023

[3] [3]

Looking be- yond appearances: Synthetic training data for deep cnns in re-identification

Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Alek- sander Rognhaugen, and Theoharis Theoharis. Looking be- yond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understand- ing, 167:50–62, 2018. 2

work page 2018

[4] [4]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 7

work page 2023

[5] [5]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 5, 7, 1

work page 2023

[6] [6]

An empirical study of clip for text-based person search

Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, and Min Zhang. An empirical study of clip for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 465–473, 2024. 1, 2, 5, 6

work page 2024

[7] [7]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Improving text-based person search by spatial matching and adaptive threshold

Tianlang Chen, Chenliang Xu, and Jiebo Luo. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1879–1887. IEEE, 2018. 2

work page 2018

[9] [9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 5, 8, 2, 4

work page 2024

[10] [10]

Noise map guidance: Inversion with spatial context for real image editing

Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, and Yonghyun Jeong. Noise map guidance: Inversion with spatial context for real image editing. arXiv preprint arXiv:2402.04625, 2024. 4, 1

work page arXiv 2024

[11] [11]

Semantically self-aligned network for text-to- image part-aware person re-identification

Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021. 2, 6

work page arXiv 2021

[12] [12]

Using language to extend to unseen do- mains

Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E Gonzalez, Aditi Raghunathan, and Anna Rohrbach. Using language to extend to unseen do- mains. International Conference on Learning Representa- tions (ICLR), 2023. 1, 2

work page 2023

[13] [13]

Scaling laws of synthetic images for model training

Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024. 2, 7

work page 2024

[14] [14]

Axm-net: Implicit cross-modal fea- ture alignment for person re-identification

Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. Axm-net: Implicit cross-modal fea- ture alignment for person re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4477– 4485, 2022. 2

work page 2022

[15] [15]

Unsuper- vised pre-training for person re-identification

Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsuper- vised pre-training for person re-identification. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14750–14759, 2021. 1, 3

work page 2021

[16] [16]

Bilma: Bidirectional local-matching for text-based person re-identification

Takuro Fujii and Shuhei Tarashima. Bilma: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2786–2790, 2023. 7

work page 2023

[17] [17]

Semi-supervised text-based person search

Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, and Min Zhang. Semi-supervised text-based person search. arXiv preprint arXiv:2404.18106, 2024. 6

work page arXiv 2024

[18] [18]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2

work page 2014

[19] [19]

Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022. 1, 2

work page arXiv 2022

[20] [20]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4, 5, 7, 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in Neural Information Processing Systems , 30, 2017. 7

work page 2017

[22] [22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Tag2text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023. 1

work page arXiv 2023

[25] [25]

Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval

Ding Jiang and Mang Ye. Cross-modal implicit relation rea- soning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2787–2797, 2023. 1, 2, 5, 6, 7

work page 2023

[26] [26]

Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

In `es Hyeonsu Kim, JoungBin Lee, Soowon Son, Woo- jeong Jin, Kyusun Cho, Junyoung Seo, Min-Seop Kwak, Seokju Cho, JeongYeol Baek, Byeongwon Lee, et al. Pose- diversified augmentation with diffusion model for person re- identification. arXiv preprint arXiv:2406.16042, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Person search with natural lan- guage description

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural lan- guage description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1970– 1979, 2017. 1, 2, 6

work page 1970

[28] [28]

Learning semantic- aligned feature representation for text-based person search

Shiping Li, Min Cao, and Min Zhang. Learning semantic- aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2724–2728. IEEE, 2022. 2, 7

work page 2022

[29] [29]

Adaptive uncertainty-based learning for text-based person retrieval

Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen. Adaptive uncertainty-based learning for text-based person retrieval. In Proceedings of the AAAI Con- ference on Artificial Intelligence, pages 3172–3180, 2024. 1, 6, 7, 5

work page 2024

[30] [30]

Cross-modal adaptive dual association for text-to-image per- son retrieval

Dixuan Lin, Yixing Peng, Jingke Meng, and Wei-Shi Zheng. Cross-modal adaptive dual association for text-to-image per- son retrieval. IEEE Transactions on Multimedia, 2024. 7

work page 2024

[31] [31]

Causality-inspired invariant representation learning for text-based person retrieval

Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, and Xun Yang. Causality-inspired invariant representation learning for text-based person retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 14052–14060,

work page

[32] [32]

Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023. 4

work page arXiv 2023

[33] [33]

Synthesizing efficient data with diffu- sion models for person re-identification pre-training

Ke Niu, Haiyang Yu, Xuelin Qian, Teng Fu, Bin Li, and Xiangyang Xue. Synthesizing efficient data with diffu- sion models for person re-identification pre-training. arXiv preprint arXiv:2406.06045, 2024. 2

work page arXiv 2024

[34] [34]

Noisy-correspondence learning for text-to-image person re-identification

Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, and Peng Hu. Noisy-correspondence learning for text-to-image person re-identification. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27197–27206, 2024. 6

work page 2024

[35] [35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2

work page 2021

[36] [36]

Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023

Dillon Reis, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. Real-time flying object detection with yolov8.arXiv preprint arXiv:2305.09972, 2023. 5

work page arXiv 2023

[37] [37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 4, 7

work page 2022

[38] [38]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 4

work page 2023

[39] [39]

Learning granularity-unified representations for text-to-image person re-identification

Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th Acm International Conference on Multimedia, pages 5566–5574, 2022. 2

work page 2022

[40] [40]

Unified pre-training with pseudo texts for text-to-image person re-identification

Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11174–11184, 2023. 1, 2, 3, 6, 7

work page 2023

[41] [41]

See finer, see more: Implicit modality alignment for text-based person retrieval

Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision , pages 624–

work page

[42] [42]

Springer, 2022. 2, 7

work page 2022

[43] [43]

Diverse person: Customize your own dataset for text-based person search

Zifan Song, Guosheng Hu, and Cairong Zhao. Diverse person: Customize your own dataset for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4943–4951, 2024. 1, 2, 3, 4

work page 2024

[44] [44]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Dissecting person re- identification from the viewpoint of viewpoint

Xiaoxiao Sun and Liang Zheng. Dissecting person re- identification from the viewpoint of viewpoint. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 608–617, 2019. 2, 6, 9

work page 2019

[46] [46]

Harnessing the power of mllms for transferable text-to-image person reid

Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yib- ing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17127–17137, 2024. 1, 2, 3, 5, 6, 7

work page 2024

[47] [47]

Learning attention-guided pyramidal features for few-shot fine-grained recognition

Hao Tang, Chengcheng Yuan, Zechao Li, and Jinhui Tang. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130:108792,

work page

[48] [48]

Stablerep: Synthetic images from text-to- image models make strong visual representation learners

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to- image models make strong visual representation learners. Advances in Neural Information Processing Systems , 36,

work page

[49] [49]

Surpassing real-world source training data: Random 3d characters for generalizable person re-identification

Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia , pages 3422–3430, 2020. 2, 6

work page 2020

[50] [50]

Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification

Yanan Wang, Xuezhi Liang, and Shengcai Liao. Cloning outfits from real-world images to 3d characters for gen- eralizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4900–4909, 2022. 2

work page 2022

[51] [51]

Vi- taa: Visual-textual attributes alignment in person search by natural language

Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vi- taa: Visual-textual attributes alignment in person search by natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 402–420. Springer, 2020. 2, 7

work page 2020

[52] [52]

Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold

Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Look before you leap: Improv- ing text-based person retrieval by learning a consistent cross- modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1984–1992,

work page 1984

[53] [53]

Person transfer gan to bridge domain gap for person re- identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 79–88,

work page

[54] [54]

Contrastive transformer learning with proximity data generation for text-based per- son search

Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen, and Liang Lin. Contrastive transformer learning with proximity data generation for text-based per- son search. IEEE Transactions on Circuits and Systems for Video Technology, 2023. 1, 2, 3, 4

work page 2023

[55] [55]

Lapscore: language- guided person search via color reasoning

Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. Lapscore: language- guided person search via color reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1624–1633, 2021. 7

work page 2021

[56] [56]

Laip: Learning local alignment from image-phrase modeling for text-based person search

Yu Wu, Haiguang Wang, Mengxia Wu, Min Cao, and Min Zhang. Laip: Learning local alignment from image-phrase modeling for text-based person search. In 2024 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–10. IEEE, 2024. 7

work page 2024

[57] [57]

Refined knowledge transfer for language-based person search

Ziqiang Wu, Bingpeng Ma, Hong Chang, and Shiguang Shan. Refined knowledge transfer for language-based person search. IEEE Transactions on Multimedia , 25:9315–9329,

work page

[58] [58]

Less is more: Learning from synthetic data with fine-grained attributes for person re-identification

Suncheng Xiang, Dahong Qian, Mengyuan Guan, Binjie Yan, Ting Liu, Yuzhuo Fu, and Guanjie You. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multime- dia Computing, Communications and Applications , 19(5s): 1–20, 2023. 2

work page 2023

[59] [59]

Image-specific information suppression and implicit local alignment for text-based person search

Shuanglin Yan, Hao Tang, Liyan Zhang, and Jinhui Tang. Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems, 2023. 2

work page 2023

[60] [60]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark

Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023. 1, 2, 3, 6, 7, 5

work page 2023

[62] [62]

Unrealperson: An adaptive pipeline towards costless person re-identification

Tianyu Zhang, Lingxi Xie, Longhui Wei, Zijie Zhuang, Yongfei Zhang, Bo Li, and Qi Tian. Unrealperson: An adaptive pipeline towards costless person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11506–11515, 2021. 2

work page 2021

[63] [63]

Deep cross-modal projection learning for image-text matching

Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 686– 701, 2018. 2

work page 2018

[64] [64]

Joint discriminative and genera- tive learning for person re-identification

Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and genera- tive learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2138–2147, 2019. 6, 9

work page 2019

[65] [65]

Camstyle: A novel data augmentation method for person re-identification

Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. Camstyle: A novel data augmentation method for person re-identification. IEEE Transactions on Image Pro- cessing, 28(3):1176–1190, 2018. 2

work page 2018

[66] [66]

Dssl: Deep surroundings-person separation learning for text-based per- son retrieval

Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based per- son retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 209–217, 2021. 2, 6, 3 An Empirical Study of Validating Synthetic Data for Text-Based Person Retriev...

work page 2021