When Large Vision-Language Models Meet Person Re-Identification

Bin Li; Qizao Wang; Xiangyang Xue

arxiv: 2411.18111 · v2 · submitted 2024-11-27 · 💻 cs.CV

When Large Vision-Language Models Meet Person Re-Identification

Qizao Wang , Bin Li , Xiangyang Xue This is my paper

Pith reviewed 2026-05-23 16:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords person re-identificationlarge vision-language modelssemantic tokenSemantic-Guided Interactioncross-modalend-to-end trainingidentity representation

0 comments

The pith

LVLM-ReID adapts large vision-language models for person re-identification by generating a single refined semantic token as the identity representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large vision-language models can be harnessed for person re-identification by instructing them to generate one semantic token that captures key appearance semantics from an image. This token is refined through a Semantic-Guided Interaction module that creates reciprocal interactions with visual tokens, allowing the model to integrate semantic understanding into end-to-end ReID training. The approach enables the LVLM to capture rich semantic cues during both training and inference without requiring additional image-text annotations. A sympathetic reader would care because it shows how generative models can support discriminative tasks like matching pedestrians across cameras, achieving competitive results on benchmarks.

Core claim

The framework employs instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics, which is refined through the Semantic-Guided Interaction module to establish reciprocal interaction between the semantic token and visual tokens, ultimately using the reinforced semantic token as the representation of pedestrian identity.

What carries the argument

The single semantic token generated by the LVLM and refined by the Semantic-Guided Interaction (SGI) module, which enables interaction with visual tokens to produce the identity representation.

Load-bearing premise

A single generated semantic token, after refinement, will reliably encode sufficient discriminative identity information for cross-camera matching without the generative objective interfering.

What would settle it

Observing that ReID accuracy falls below standard visual baselines when the SGI module is ablated or when the semantic token is replaced with a purely visual feature.

Figures

Figures reproduced from arXiv: 2411.18111 by Bin Li, Qizao Wang, Xiangyang Xue.

**Figure 1.** Figure 1: Comparison of different person ReID frameworks. (a) Conventionally, a visual encoder is applied to extract pedestrian identity representations, overlooking the supplemented semantics from other modalities. (b) CLIP-ReID uses the text encoder of CLIP to introduce text semantics based on the contrastive learning paradigm. (c) Our proposed LVLM-ReID incorporates LVLM in the ReID pipeline. Through instruction… view at source ↗

**Figure 2.** Figure 2: Framework of our LVLM-ReID. It leverages clear instructions to guide the frozen LLM towards focusing on particular visual semantics within pedestrian images, resulting in the generation of one semantic token that encapsulates the pedestrian’s appearance information. Subsequently, an efficient interaction module is designed to facilitate refinement between the generated token and the visual tokens. Finally,… view at source ↗

**Figure 3.** Figure 3: Visualization of attention maps. We show (a) the original images, and compare the attentions of (b) the “Ours w/o PSTG” variant, and (c) our LVLM-ReID model, on CUHK03. 4.4. Qualitative Analysis To understand the identity-related information in the semantic token and demonstrate the effectiveness of LVLM in enriching pedestrian semantics, we analyze the attention maps using [3] in [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) that incorporate visual models and large language models have achieved impressive results across cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the representation of pedestrian identity. Our framework integrates the semantic understanding and generation capabilities of LVLM into end-to-end ReID training, allowing LVLM to capture rich semantic cues during both training and inference. LVLM-ReID achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVLM-ReID generates one instruction-guided semantic token refined by SGI to serve as a ReID embedding, which is a fresh integration but leaves the token's cross-camera stability unproven.

read the letter

The paper's core move is to take an LVLM, prompt it to output exactly one semantic token from a person image, run that token through a Semantic-Guided Interaction module that lets it exchange information with the visual tokens, and then use the result as the sole identity descriptor for ReID matching. This setup is presented as a way to fold LVLM semantic generation into end-to-end ReID training without extra image-text pairs. The specific token-plus-SGI construction looks new relative to the ReID and LVLM work they cite, and the claim of competitive benchmark numbers without added annotations is a practical point in its favor. The framing of the generative-versus-discriminative tension is also clear and honest. The soft spot is exactly the one the stress-test flags: nothing in the described architecture forces the single reinforced token to extract camera-invariant identity cues rather than prompt-dependent or view-specific semantics. The generative next-token objective does not naturally produce metric-stable features, and the abstract supplies no auxiliary losses, ablations, or analysis showing that the token behaves like a standard ReID descriptor. Without those details the central claim rests on implementation choices that are not visible here. This is aimed at ReID researchers who want to test whether LVLM semantics can be repurposed for retrieval. A reader already working on multimodal ReID or LVLM adaptation would get concrete ideas to try. I would send it to peer review because the integration is distinct enough that the experiments deserve a full check, even if the token-stability question will need stronger evidence.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes LVLM-ReID, a framework that uses instructions to guide an LVLM in generating a single semantic token from a person image, refines this token via a Semantic-Guided Interaction (SGI) module that creates reciprocal interaction with visual tokens, and employs the reinforced token as the sole pedestrian identity representation for ReID. The approach integrates LVLM semantic understanding and generation into end-to-end ReID training and reports competitive results on multiple benchmarks without requiring additional image-text annotations.

Significance. If the empirical results hold under scrutiny, the work offers a concrete demonstration of adapting generative LVLMs to a discriminative metric-learning task without auxiliary annotations, which could stimulate further exploration of language-guided semantics in ReID. The end-to-end integration and single-token design are distinctive, though their effectiveness hinges on unverified stability properties.

major comments (2)

[Abstract] Abstract: the central claim that the single reinforced semantic token (after SGI) serves as a reliable identity representation for cross-camera matching rests on the unstated assumption that next-token prediction can be repurposed for metric stability; no auxiliary loss, architectural constraint, or training objective is described that would prevent the token from encoding prompt-dependent or view-specific attributes instead of camera-invariant identity cues.
[Abstract] Abstract: the reported competitive results on benchmarks are presented without reference to implementation details of how the semantic token is extracted at inference, how similarity is computed, or any ablation isolating the contribution of SGI versus the base LVLM forward pass, making it impossible to assess whether the performance reduces to standard ReID backbones.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces LVLM-ReID as a new end-to-end framework that uses instruction-guided LVLM generation of a single semantic token, refined via the SGI module, to produce ReID embeddings. This construction and the reported competitive benchmark results are presented as empirical outcomes of the proposed architecture and training procedure. No load-bearing step reduces by definition, by fitted-parameter renaming, or by self-citation chain to its own inputs; the derivation chain remains independent of the final performance numbers and relies on external benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the framework description implies standard LVLM components plus one new module whose internal mechanics are not detailed.

pith-pipeline@v0.9.0 · 5765 in / 1131 out tokens · 25230 ms · 2026-05-23T16:51:12.535345+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
cs.CV 2026-04 unverdicted novelty 7.0

STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation
cs.CV 2026-04 unverdicted novelty 5.0

A multi-view semantic reformulation and feature compensation method using LLMs and VLMs improves text-to-image person retrieval accuracy without training and reaches SOTA on three datasets.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page
[3]

Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 397–406,

work page
[4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representat...

work page 2021
[5]

Transreid: Transformer-based object re- identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15013–15022, 2021. 2, 6, 7

work page 2021
[6]

In Defense of the Triplet Loss for Person Re-Identification

Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de- fense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page
[8]

Semantics-aligned representation learning for person re-identification

Xin Jin, Cuiling Lan, Wenjun Zeng, Guoqiang Wei, and Zhibo Chen. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Confer- ence on Artificial Intelligence, pages 11173–11180, 2020. 1, 2, 6

work page 2020
[9]

Combined depth space based architecture search for person re-identification

Hanjun Li, Gaojie Wu, and Wei-Shi Zheng. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6729–6738, 2021. 1, 2, 6

work page 2021
[10]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2, 3

work page 2023
[11]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1405–1413, 2023. 1, 2, 5, 6

work page 2023
[12]

Deep- reid: Deep filter pairing neural network for person re- identification

Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep- reid: Deep filter pairing neural network for person re- identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 152– 159, 2014. 5

work page 2014
[13]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2

work page 2024
[14]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2, 3

work page 2024
[15]

A strong baseline and batch normalization neck for deep person re-identification

Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu. A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 22(10):2597–2609, 2019. 1, 4, 5

work page 2019
[16]

Relation network for per- son re-identification

Hyunjong Park and Bumsub Ham. Relation network for per- son re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11839–11847, 2020. 1, 2, 6

work page 2020
[17]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021
[18]

Counterfactual attention learning for fine-grained visual cat- egorization and re-identification

Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual attention learning for fine-grained visual cat- egorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1025–1034, 2021. 1, 2, 6

work page 2021
[19]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision Workshops, pages 17–35, 2016. 2, 5

work page 2016
[20]

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Pro- ceedings of the European Conference on Computer Vision , pages 480–496, 2018. 2

work page 2018
[21]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 4, 6, 7

work page 2017
[24]

Learning discriminative features with multiple gran- 9 ularities for person re-identification

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple gran- 9 ularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia , pages 274–282, 2018. 2, 6

work page 2018
[25]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Rethinking person re-identification from a projection-on-prototypes perspective

Qizao Wang, Xuelin Qian, Bin Li, Yanwei Fu, and Xi- angyang Xue. Rethinking person re-identification from a projection-on-prototypes perspective. arXiv preprint arXiv:2308.10717, 2023. 1

work page arXiv 2023
[27]

Pose-guided feature disentangling for occluded person re-identification based on transformer

Tao Wang, Hong Liu, Pinhao Song, Tianyu Guo, and Wei Shi. Pose-guided feature disentangling for occluded person re-identification based on transformer. In Proceedings of the AAAI conference on artificial intelligence, pages 2540–2549,

work page
[28]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Deep learning for person re- identification: A survey and outlook

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(6):2872– 2893, 2021. 1, 2

work page 2021
[30]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In defense of the classification loss for person re-identification

Yao Zhai, Xun Guo, Yan Lu, and Houqiang Li. In defense of the classification loss for person re-identification. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 1

work page 2019
[32]

Relation-aware global attention for person re- identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re- identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3186– 3195, 2020. 2, 6

work page 2020
[33]

Pyramidal person re-identification via multi-loss dy- namic training

Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xi- aowei Guo, Zongqiao Yu, Feiyue Huang, and Rongrong Ji. Pyramidal person re-identification via multi-loss dy- namic training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8514– 8522, 2019. 2, 6

work page 2019
[34]

Scalable person re-identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 1116–1124,

work page
[35]

A discrimi- natively learned cnn embedding for person reidentification

Zhedong Zheng, Liang Zheng, and Yi Yang. A discrimi- natively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communica- tions, and Applications, 14(1):1–20, 2017. 1

work page 2017
[36]

Joint discriminative and genera- tive learning for person re-identification

Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and genera- tive learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2138–2147, 2019. 2, 6

work page 2019
[37]

Re- ranking person re-identification with k-reciprocal encoding

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3652–3661, 2017. 5

work page 2017
[38]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 5

work page 2020
[39]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Dual cross-attention learning for fine-grained visual categorization and object re-identification

Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Dual cross-attention learning for fine-grained visual categorization and object re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022. 2, 6

work page 2022
[41]

Aaformer: Auto-aligned transformer for person re-identification

Kuan Zhu, Haiyun Guo, Shiliang Zhang, Yaowei Wang, Jing Liu, Jinqiao Wang, and Ming Tang. Aaformer: Auto-aligned transformer for person re-identification. IEEE Transactions on Neural Networks and Learning Systems, 2023. 2, 6 10

work page 2023

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page

[3] [3]

Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 397–406,

work page

[4] [4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representat...

work page 2021

[5] [5]

Transreid: Transformer-based object re- identification

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15013–15022, 2021. 2, 6, 7

work page 2021

[6] [6]

In Defense of the Triplet Loss for Person Re-Identification

Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de- fense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page

[8] [8]

Semantics-aligned representation learning for person re-identification

Xin Jin, Cuiling Lan, Wenjun Zeng, Guoqiang Wei, and Zhibo Chen. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Confer- ence on Artificial Intelligence, pages 11173–11180, 2020. 1, 2, 6

work page 2020

[9] [9]

Combined depth space based architecture search for person re-identification

Hanjun Li, Gaojie Wu, and Wei-Shi Zheng. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6729–6738, 2021. 1, 2, 6

work page 2021

[10] [10]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2, 3

work page 2023

[11] [11]

Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1405–1413, 2023. 1, 2, 5, 6

work page 2023

[12] [12]

Deep- reid: Deep filter pairing neural network for person re- identification

Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep- reid: Deep filter pairing neural network for person re- identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 152– 159, 2014. 5

work page 2014

[13] [13]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2

work page 2024

[14] [14]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2, 3

work page 2024

[15] [15]

A strong baseline and batch normalization neck for deep person re-identification

Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu. A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 22(10):2597–2609, 2019. 1, 4, 5

work page 2019

[16] [16]

Relation network for per- son re-identification

Hyunjong Park and Bumsub Ham. Relation network for per- son re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11839–11847, 2020. 1, 2, 6

work page 2020

[17] [17]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021

[18] [18]

Counterfactual attention learning for fine-grained visual cat- egorization and re-identification

Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual attention learning for fine-grained visual cat- egorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1025–1034, 2021. 1, 2, 6

work page 2021

[19] [19]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision Workshops, pages 17–35, 2016. 2, 5

work page 2016

[20] [20]

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Pro- ceedings of the European Conference on Computer Vision , pages 480–496, 2018. 2

work page 2018

[21] [21]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 2, 4, 6, 7

work page 2017

[24] [24]

Learning discriminative features with multiple gran- 9 ularities for person re-identification

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple gran- 9 ularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia , pages 274–282, 2018. 2, 6

work page 2018

[25] [25]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Rethinking person re-identification from a projection-on-prototypes perspective

Qizao Wang, Xuelin Qian, Bin Li, Yanwei Fu, and Xi- angyang Xue. Rethinking person re-identification from a projection-on-prototypes perspective. arXiv preprint arXiv:2308.10717, 2023. 1

work page arXiv 2023

[27] [27]

Pose-guided feature disentangling for occluded person re-identification based on transformer

Tao Wang, Hong Liu, Pinhao Song, Tianyu Guo, and Wei Shi. Pose-guided feature disentangling for occluded person re-identification based on transformer. In Proceedings of the AAAI conference on artificial intelligence, pages 2540–2549,

work page

[28] [28]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Deep learning for person re- identification: A survey and outlook

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(6):2872– 2893, 2021. 1, 2

work page 2021

[30] [30]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In defense of the classification loss for person re-identification

Yao Zhai, Xun Guo, Yan Lu, and Houqiang Li. In defense of the classification loss for person re-identification. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 1

work page 2019

[32] [32]

Relation-aware global attention for person re- identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re- identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3186– 3195, 2020. 2, 6

work page 2020

[33] [33]

Pyramidal person re-identification via multi-loss dy- namic training

Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xi- aowei Guo, Zongqiao Yu, Feiyue Huang, and Rongrong Ji. Pyramidal person re-identification via multi-loss dy- namic training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8514– 8522, 2019. 2, 6

work page 2019

[34] [34]

Scalable person re-identification: A benchmark

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 1116–1124,

work page

[35] [35]

A discrimi- natively learned cnn embedding for person reidentification

Zhedong Zheng, Liang Zheng, and Yi Yang. A discrimi- natively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communica- tions, and Applications, 14(1):1–20, 2017. 1

work page 2017

[36] [36]

Joint discriminative and genera- tive learning for person re-identification

Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and genera- tive learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2138–2147, 2019. 2, 6

work page 2019

[37] [37]

Re- ranking person re-identification with k-reciprocal encoding

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3652–3661, 2017. 5

work page 2017

[38] [38]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 5

work page 2020

[39] [39]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Dual cross-attention learning for fine-grained visual categorization and object re-identification

Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Dual cross-attention learning for fine-grained visual categorization and object re-identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022. 2, 6

work page 2022

[41] [41]

Aaformer: Auto-aligned transformer for person re-identification

Kuan Zhu, Haiyun Guo, Shiliang Zhang, Yaowei Wang, Jing Liu, Jinqiao Wang, and Ming Tang. Aaformer: Auto-aligned transformer for person re-identification. IEEE Transactions on Neural Networks and Learning Systems, 2023. 2, 6 10

work page 2023