Recognition: unknown
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
Pith reviewed 2026-05-10 11:57 UTC · model grok-4.3
The pith
Semantic text from vision-language models filters tokens and routes experts to handle clothing and modality changes in person re-identification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guiding large vision-language models with instructions to produce identity-intrinsic semantic text that captures biometric constants allows this text to power Semantic-driven Visual Token Filtering (SVTF), which strengthens informative visual areas and reduces background noise, and Semantic-driven Expert Routing (SER), which incorporates the text for more stable multi-scenario gating, yielding stronger retrieval under clothing variations and RGB-IR shifts.
What carries the argument
Identity consistency text generated by large vision-language models, applied through Semantic-driven Visual Token Filtering (SVTF) to select visual regions and Semantic-driven Expert Routing (SER) to adapt expert selection.
If this is right
- State-of-the-art results on the Any-Time ReID dataset (AT-USTC) for arbitrary clothing and modality conditions.
- Competitive or superior performance on five standard person re-identification benchmarks after training only on AT-USTC.
- Explicit robustness gains against both short-term and long-term clothing changes.
- More reliable cross-modality matching between RGB daytime and IR nighttime images.
Where Pith is reading between the lines
- Linguistic identity cues could reduce reliance on visual appearance in other retrieval or tracking tasks that face appearance variation.
- The same text-driven filtering and routing pattern might allow models to adapt to new conditions without full retraining.
- Combining the approach with additional language sources could further stabilize the identity text across diverse inputs.
Load-bearing premise
Large vision-language models guided by instructions will reliably output semantic text that reflects unchanging biometric identity features instead of variable clothing or lighting details.
What would settle it
Controlled tests on the AT-USTC dataset where the generated semantic text shows no accuracy gain on clothing-change or RGB-IR cases, or where disabling SVTF and SER leaves performance unchanged.
Figures
read the original abstract
Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Semantic-driven Token Filtering and Expert Routing (STFER) for Any-Time Person Re-identification (AT-ReID). It uses Large Vision-Language Models (LVLMs) guided by instructions to generate identity-intrinsic semantic text capturing biometric constants. This text drives Semantic-driven Visual Token Filtering (SVTF) to enhance informative visual regions while suppressing background noise, and Semantic-driven Expert Routing (SER) to integrate semantics into multi-scenario gating. The framework is claimed to yield features robust to clothing changes and RGB-IR modality shifts, with state-of-the-art results on the AT-USTC dataset and competitive generalization on five standard ReID benchmarks.
Significance. If the LVLM-generated semantic text reliably supplies clothing- and modality-invariant identity cues that improve upon pure visual baselines, the work could meaningfully advance AT-ReID by demonstrating a practical way to combine semantic guidance with token-level filtering and expert routing. The novelty lies in the specific SVTF and SER modules and the introduction of the AT-USTC dataset; successful validation would provide a template for leveraging LVLMs in other vision tasks requiring invariance to appearance changes.
major comments (3)
- Abstract: The central claim that the generated identity consistency text provides features 'robust to both clothing variations and cross-modality shifts' is load-bearing for the entire contribution, yet no consistency metric, example generations across clothing or RGB/IR pairs, or ablation isolating the semantic component from visual baselines is referenced, leaving the invariance property as an untested premise rather than a demonstrated result.
- Abstract: The assertion of 'state-of-the-art results' on AT-USTC and 'superior generalization capabilities' on five benchmarks is made without any quantitative numbers, tables, error bars, or baseline comparisons, which prevents assessment of whether the SVTF and SER modules deliver the claimed improvements.
- Abstract: The description of how instructions guide the LVLM to suppress clothing and modality-specific cues while preserving biometric constants lacks any implementation details, prompt examples, or verification procedure, making it impossible to evaluate whether the semantic text actually functions as intended.
minor comments (2)
- Abstract: The sentence 'resulting in significantly performance deterioration' contains a grammatical error and should read 'resulting in significant performance deterioration'.
- Abstract: The relationship between the text token, SVTF, and SER is described at a high level; a brief sentence clarifying the information flow between the two modules would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the abstract would benefit from additional specificity and will revise it to reference key results, metrics, and implementation details from the full manuscript while preserving its concise nature.
read point-by-point responses
-
Referee: Abstract: The central claim that the generated identity consistency text provides features 'robust to both clothing variations and cross-modality shifts' is load-bearing for the entire contribution, yet no consistency metric, example generations across clothing or RGB/IR pairs, or ablation isolating the semantic component from visual baselines is referenced, leaving the invariance property as an untested premise rather than a demonstrated result.
Authors: The abstract summarizes the core idea, but the full manuscript provides supporting evidence: Section 4.3 contains ablations that isolate the contribution of the semantic text (comparing against pure visual baselines), Figure 3 shows example LVLM-generated identity-consistent texts across clothing changes and RGB-IR pairs, and we introduce a semantic consistency metric based on embedding similarity. We will revise the abstract to briefly reference these elements and note the observed robustness. revision: yes
-
Referee: Abstract: The assertion of 'state-of-the-art results' on AT-USTC and 'superior generalization capabilities' on five benchmarks is made without any quantitative numbers, tables, error bars, or baseline comparisons, which prevents assessment of whether the SVTF and SER modules deliver the claimed improvements.
Authors: We acknowledge that the abstract lacks specific numbers. The full paper includes Table 1 (AT-USTC results with Rank-1/mAP and comparisons to recent AT-ReID methods) and Table 2 (generalization on five standard benchmarks with error bars from multiple runs). We will update the abstract to incorporate key quantitative highlights, such as the reported Rank-1 improvement on AT-USTC and average gains on the other datasets. revision: yes
-
Referee: Abstract: The description of how instructions guide the LVLM to suppress clothing and modality-specific cues while preserving biometric constants lacks any implementation details, prompt examples, or verification procedure, making it impossible to evaluate whether the semantic text actually functions as intended.
Authors: The abstract is space-constrained, but Section 3.1 details the instruction design (prompts that emphasize biometric attributes like body structure and gait while instructing the LVLM to ignore clothing and illumination), with full prompt templates provided in the supplementary material. Verification occurs via qualitative examples and quantitative checks in Section 4.1. We will revise the abstract to include a short description of the guidance strategy and a pointer to the prompts. revision: yes
Circularity Check
No significant circularity; derivation relies on external LVLM capabilities and introduces independent modules
full rationale
The paper's central framework (STFER) is defined by proposing two new modules (SVTF and SER) that consume text tokens generated by an external LVLM guided by instructions. No equations or derivations reduce a claimed prediction or result back to a fitted parameter or self-referential definition within the paper. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The approach depends on the (external) assumption that LVLMs can produce identity-intrinsic text, but this is not a circular reduction by construction; it is an unverified premise about an outside model. The derivation chain remains self-contained against external benchmarks and does not rename known results or smuggle in prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LVLMs can be guided by instructions to generate identity-intrinsic semantic text that captures biometric constants robust to clothing and modality changes
invented entities (2)
-
Semantic-driven Visual Token Filtering (SVTF)
no independent evidence
-
Semantic-driven Expert Routing (SER)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Towards anytime retrieval: A benchmark for anytime person re-identification,
X. Li, Y . Lu, B. Liu, J. Li, Q. Yang, T. Gong, Q. Chu, M. Ye, and N. Yu, “Towards anytime retrieval: A benchmark for anytime person re-identification,”arXiv preprint arXiv:2509.16635, 2025
-
[2]
Scalable person re-identification: A benchmark,
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” inProceedings of the IEEE international conference on computer vision, pp. 1116–1124, 2015
2015
-
[3]
Rgb-infrared cross-modality person re-identification,
A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross-modality person re-identification,” inProceedings of the IEEE international conference on computer vision, pp. 5380–5389, 2017
2017
-
[4]
Person re-identification by contour sketch under moderate clothing change,
Q. Yang, A. Wu, and W.-S. Zheng, “Person re-identification by contour sketch under moderate clothing change,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 2029–2046, 2019
2029
-
[5]
Attribute-aware attention model for fine-grained representation learning,
K. Han, J. Guo, C. Zhang, and M. Zhu, “Attribute-aware attention model for fine-grained representation learning,” inProceedings of the 26th ACM international conference on Multimedia, pp. 2040–2048, 2018
2040
-
[6]
Improving person re-identification by attribute and identity learning,
Y . Lin, L. Zheng, Z. Zheng, Y . Wu, Z. Hu, C. Yan, and Y . Yang, “Improving person re-identification by attribute and identity learning,”Pattern recognition, vol. 95, pp. 151–161, 2019
2019
-
[7]
Cerberus: Attribute-based person re-identification using semantic ids,
C. Eom, G. Lee, K. Cho, H. Jung, M. Jin, and B. Ham, “Cerberus: Attribute-based person re-identification using semantic ids,”Expert Systems with Applications, vol. 259, p. 125320, 2025
2025
-
[8]
Dma: Dual modality-aware alignment for visible-infrared person re- identification,
Z. Cui, J. Zhou, and Y . Peng, “Dma: Dual modality-aware alignment for visible-infrared person re- identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 2696–2708, 2024
2024
-
[9]
Weakly supervised visible-infrared person re-identification via heterogeneous expert collaborative consistency learning,
Y . Zhang, L. Kong, H. Li, and J. Wen, “Weakly supervised visible-infrared person re-identification via heterogeneous expert collaborative consistency learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12659–12669, 2025
2025
-
[10]
Dual level adaptive weighting for cloth-changing person re-identification,
F. Liu, M. Ye, and B. Du, “Dual level adaptive weighting for cloth-changing person re-identification,”IEEE Transactions on Image Processing, vol. 32, pp. 5075–5086, 2023
2023
-
[11]
S. Huang, Y . Zhou, R. Prabhakar, X. Liu, Y . Guo, H. Yi, C. Peng, R. Chellappa, and C. P. Lau, “Self- supervised learning of whole and component-based semantic representations for person re-identification,” arXiv preprint arXiv:2311.17074, vol. 4, 2023
-
[12]
Geff: Improving any clothes-changing person reid model using gallery enrichment with face features,
D. Arkushin, B. Cohen, S. Peleg, and O. Fried, “Geff: Improving any clothes-changing person reid model using gallery enrichment with face features,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 152–162, 2024
2024
-
[13]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[14]
Transreid: Transformer-based object re- identification,
S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid: Transformer-based object re- identification,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 15013– 15022, 2021
2021
-
[15]
Personvit: large-scale self-supervised vision transformer for person re- identification,
B. Hu, X. Wang, and W. Liu, “Personvit: large-scale self-supervised vision transformer for person re- identification,”Machine Vision and Applications, vol. 36, no. 2, p. 32, 2025
2025
-
[16]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
Circle loss: A unified perspective of pair similarity optimization,
Y . Sun, C. Cheng, Y . Zhang, C. Zhang, L. Zheng, Z. Wang, and Y . Wei, “Circle loss: A unified perspective of pair similarity optimization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6398–6407, 2020
2020
-
[18]
Condense loss: Exploiting vector magnitude during person re-identification training process,
X. Yang, W. Dong, Y . Tang, G. Zheng, N. Wang, and X. Gao, “Condense loss: Exploiting vector magnitude during person re-identification training process,”Pattern Recognition, p. 112443, 2025
2025
-
[19]
Shape-erased feature learning for visible-infrared person re- identification,
J. Feng, A. Wu, and W.-S. Zheng, “Shape-erased feature learning for visible-infrared person re- identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22752–22761, 2023. 10
2023
-
[20]
Towards grand unified representation learning for unsupervised visible- infrared person re-identification,
B. Yang, J. Chen, and M. Ye, “Towards grand unified representation learning for unsupervised visible- infrared person re-identification,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11069–11079, 2023
2023
-
[21]
Visible-infrared person re-identification via semantic alignment and affinity inference,
X. Fang, Y . Yang, and Y . Fu, “Visible-infrared person re-identification via semantic alignment and affinity inference,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11270– 11279, 2023
2023
-
[22]
Counterfactual intervention feature transfer for visible-infrared person re-identification,
X. Li, Y . Lu, B. Liu, Y . Liu, G. Yin, Q. Chu, J. Huang, F. Zhu, R. Zhao, and N. Yu, “Counterfactual intervention feature transfer for visible-infrared person re-identification,” inEuropean conference on computer vision, pp. 381–398, Springer, 2022
2022
-
[23]
Dynamic dual-attentive aggregation learning for visible- infrared person re-identification,
M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamic dual-attentive aggregation learning for visible- infrared person re-identification,” inEuropean conference on computer vision, pp. 229–247, Springer, 2020
2020
-
[24]
Ugg-reid: Uncertainty-guided graph model for multi-modal object re-identification,
X. Wan, A. Zheng, B. Jiang, B. Wang, C. Li, and J. Tang, “Ugg-reid: Uncertainty-guided graph model for multi-modal object re-identification,”arXiv preprint arXiv:2507.04638, 2025
-
[25]
Y . Feng, J. Li, J. Hu, Y . Zhang, L. Tan, and J. Ji, “Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification,”arXiv preprint arXiv:2510.23301, 2025
-
[26]
Clothes-changing person re-identification with rgb modality only,
X. Gu, H. Chang, B. Ma, S. Bai, S. Shan, and X. Chen, “Clothes-changing person re-identification with rgb modality only,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1060–1069, 2022
2022
-
[27]
Clothing-change feature augmentation for person re- identification,
K. Han, S. Gong, Y . Huang, L. Wang, and T. Tan, “Clothing-change feature augmentation for person re- identification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22066–22075, 2023
2023
-
[28]
Vision transformer-based robust learning for cloth-changing person re-identification,
C. Xue, Z. Deng, W. Yang, E. Hu, Y . Zhang, S. Wang, and Y . Wang, “Vision transformer-based robust learning for cloth-changing person re-identification,”Applied Soft Computing, vol. 163, p. 111891, 2024
2024
-
[29]
Identity-guided collaborative learning for cloth-changing person reidentification,
Z. Gao, S. Wei, W. Guan, L. Zhu, M. Wang, and S. Chen, “Identity-guided collaborative learning for cloth-changing person reidentification,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2819–2837, 2023
2023
-
[30]
Semantic-aware consistency network for cloth-changing person re-identification,
P. Guo, H. Liu, J. Wu, G. Wang, and T. Wang, “Semantic-aware consistency network for cloth-changing person re-identification,” inProceedings of the 31st ACM international conference on multimedia, pp. 8730– 8739, 2023
2023
-
[31]
Learning 3d shape feature for texture-insensitive person re-identification,
J. Chen, X. Jiang, F. Wang, J. Zhang, F. Zheng, X. Sun, and W.-S. Zheng, “Learning 3d shape feature for texture-insensitive person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8146–8155, 2021
2021
-
[32]
Cloth-changing person re-identification from a single image with gait prediction and regularization,
X. Jin, T. He, K. Zheng, Z. Yin, X. Shen, Z. Huang, R. Feng, J. Huang, Z. Chen, and X.-S. Hua, “Cloth-changing person re-identification from a single image with gait prediction and regularization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14278–14287, 2022
2022
-
[33]
All in one framework for multimodal re-identification in the wild,
H. Li, M. Ye, M. Zhang, and B. Du, “All in one framework for multimodal re-identification in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17459–17469, 2024
2024
-
[34]
Towards modality-agnostic person re-identification with descriptive query,
C. Chen, M. Ye, and D. Jiang, “Towards modality-agnostic person re-identification with descriptive query,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15128–15137, 2023
2023
-
[35]
Unihcp: A unified model for human-centric perceptions,
Y . Ci, Y . Wang, M. Chen, S. Tang, L. Bai, F. Zhu, R. Zhao, F. Yu, D. Qi, and W. Ouyang, “Unihcp: A unified model for human-centric perceptions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17840–17852, 2023
2023
-
[36]
Retrieve anyone: A general-purpose person re-identification task with instructions,
W. He, S. Tang, Y . Deng, Q. Chen, Q. Xie, Y . Wang, L. Bai, F. Zhu, R. Zhao, W. Ouyang,et al., “Retrieve anyone: A general-purpose person re-identification task with instructions,”arXiv preprint arXiv:2306.07520, vol. 3, 2023
-
[37]
Reid5o: Achieving omni multi-modal person re-identification in a single model,
J. Zuo, Y . Deng, M. Tan, R. Jin, D. Wu, N. Sang, L. Pan, and C. Gao, “Reid5o: Achieving omni multi-modal person re-identification in a single model,”arXiv preprint arXiv:2506.09385, 2025. 11
-
[38]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang,et al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models. arxiv 2023,”arXiv preprint arXiv:2302.13971, vol. 10, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen,et al., “Qwen-image technical report,”arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,
W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y . Qiao,et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,”Advances in Neural Information Processing Systems, vol. 36, pp. 61501–61513, 2023
2023
-
[43]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34892–34916, 2023
2023
-
[44]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Mil- lican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Vip: Versatile image outpainting empowered by multimodal large language model,
J. Yang, H. Wang, Z. Zhu, C. Liu, M. Wu, and M. Sun, “Vip: Versatile image outpainting empowered by multimodal large language model,” inProceedings of the Asian Conference on Computer Vision, pp. 1082–1099, 2024
2024
-
[47]
Instagen: Enhancing object detection by training on synthetic dataset,
C. Feng, Y . Zhong, Z. Jie, W. Xie, and L. Ma, “Instagen: Enhancing object detection by training on synthetic dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14121–14130, 2024
2024
-
[48]
Taming self-training for open-vocabulary object detection,
S. Zhao, S. Schulter, L. Zhao, Z. Zhang, Y . Suh, M. Chandraker, D. N. Metaxas,et al., “Taming self-training for open-vocabulary object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13938–13947, 2024
2024
-
[49]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
2020
-
[50]
Florence-2: Advancing a unified representation for a variety of vision tasks,
B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4818–4829, 2024
2024
-
[51]
Gsva: Generalized segmentation via multimodal large language models,
Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3858–3869, 2024
2024
-
[52]
When large vision-language models meet person re-identification,
Q. Wang, B. Li, and X. Xue, “When large vision-language models meet person re-identification,”arXiv preprint arXiv:2411.18111, 2024
work page internal anchor Pith review arXiv 2024
-
[53]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Chatreid: Open-ended interactive person retrieval via hierarchical progressive tuning for vision language models,
K. Niu, H. Yu, M. Zhao, T. Fu, S. Yi, W. Lu, B. Li, X. Qian, and X. Xue, “Chatreid: Open-ended interactive person retrieval via hierarchical progressive tuning for vision language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24245–24254, 2025
2025
-
[55]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016
2016
-
[56]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Deepreid: Deep filter pairing neural network for person re- identification,
W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re- identification,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 152–159, 2014. 12
2014
-
[58]
Long-term cloth-changing person re-identification,
X. Qian, W. Wang, L. Zhang, F. Zhu, Y . Fu, T. Xiang, Y .-G. Jiang, and X. Xue, “Long-term cloth-changing person re-identification,” inProceedings of the Asian conference on computer vision, 2020
2020
-
[59]
Bag of tricks and a strong baseline for deep person re-identification,
H. Luo, Y . Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a strong baseline for deep person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0, 2019
2019
-
[60]
Clip-reid: exploiting vision-language model for image re-identification without concrete text labels,
S. Li, L. Sun, and Q. Li, “Clip-reid: exploiting vision-language model for image re-identification without concrete text labels,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, pp. 1405– 1413, 2023
2023
-
[61]
Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification,
Z. Yang, M. Lin, X. Zhong, Y . Wu, and Z. Wang, “Good is bad: Causality inspired cloth-debiasing for cloth-changing person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1472–1481, 2023
2023
-
[62]
X. Li, Y . Lu, B. Liu, Y . Hou, Y . Liu, Q. Chu, W. Ouyang, and N. Yu, “Clothes-invariant feature learning by causal intervention for clothes-changing person re-identification,”arXiv preprint arXiv:2305.06145, 2023
-
[63]
Channel augmented joint learning for visible-infrared recognition,
M. Ye, W. Ruan, B. Du, and M. Z. Shou, “Channel augmented joint learning for visible-infrared recognition,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 13567–13576, 2021
2021
-
[64]
Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification,
Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2153–2162, 2023. 13
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.