InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

Aryan Shukla; Eric Granger; Maguelonne Heritier; Rajarshi Bhattacharya; Shakeeb Murtaza

arxiv: 2604.27122 · v1 · submitted 2026-04-29 · 💻 cs.CV

InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

Shakeeb Murtaza , Aryan Shukla , Rajarshi Bhattacharya , Maguelonne Heritier , Eric Granger This is my paper

Pith reviewed 2026-05-07 09:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords person re-identificationtext-to-image retrievalinterpretable AIpart matchingvision-language modelsexplainable retrieval

0 comments

The pith

InterPartAbility adds text-guided part matching and constrained attention to produce grounded explanations for text-to-image person re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterPartAbility as a way to make text-to-image person re-identification interpretable by shifting from black-box matching to explicit part-wise alignment. It adds a lightweight patch-phrase interaction module that supplies concept-level supervision from part phrases in the text, training the model to link specific image regions to those phrases. The method further restricts the CLIP vision transformer's self-attention so that activations for each phrase become spatially concentrated, creating explanation maps that can be tested with perturbation metrics such as removing top-ranked regions and measuring retrieval drop. On standard benchmarks the approach reports stronger quantitative interpretability scores while retrieval accuracy stays competitive with prior models.

Core claim

By training with a patch-phrase interaction module that supplies open-vocabulary part-phrase guidance and by constraining CLIP ViT self-attention to produce localized patch activations, the model learns to bind visual regions directly to semantically meaningful phrases, yielding explanation maps whose quality can be measured by how much retrieval performance degrades when the highlighted regions are masked.

What carries the argument

The patch-phrase interaction module (PPIM) that supplies concept-level supervision to align image patches with part phrases extracted from the text description.

If this is right

Interpretability in text-to-image re-identification can be evaluated with quantitative perturbation metrics instead of relying only on qualitative visualizations.
Explanations become open-vocabulary and tied to specific part phrases rather than limited to a closed set of concepts.
The same model can be used for both retrieval and for producing region-level evidence that supports each match.
Performance on standard retrieval benchmarks does not have to be traded off against improved explainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same supervision and attention constraint could be tested on other vision-language retrieval tasks where part-level grounding would help users verify decisions.
If the bindings prove stable across datasets, the approach might reduce the need for separate post-hoc explanation generators in deployed re-identification systems.
Quantitative grounding scores could become an additional training objective for future vision-language models that must justify matches to operators.

Load-bearing premise

The patch-phrase interaction module together with the constrained self-attention actually creates reliable bindings between image regions and the intended part phrases rather than spurious correlations.

What would settle it

An experiment that masks the highest-ranked explanatory regions identified by InterPartAbility and measures whether retrieval accuracy drops more than when the same number of randomly chosen regions are masked.

Figures

Figures reproduced from arXiv: 2604.27122 by Aryan Shukla, Eric Granger, Maguelonne Heritier, Rajarshi Bhattacharya, Shakeeb Murtaza.

**Figure 1.** Figure 1: TI-ReID alignment paradigms. (a) Global matching: CLIP-based methods produce global image-text similarity, offering no insight into which regions. (b) Conceptlevel matching (PLOT, DiCo): slot attention decomposes features into concept regions but fails to bind slots to specific textual phrases, yielding unlabelled qualitative visualizations with high computational cost due to slots. (c) InterPartAbilit… view at source ↗

**Figure 2.** Figure 2: Overview of InterPartAbility. An image and caption are encoded by CLIP encoders EI and ET . Global embeddings are trained with the base retrieval objective Lbase. The image encoder additionally produces patch embeddings Zi ∈ R K×D. Each appearance phrase ℓi,p is encoded into a phrase embedding Hi ∈ R P ×D. The Patch-Phrase Interaction Module computes phrase-patch similarity and softly aggregates patch fe… view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of relevance-based masking. (a) view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of phrase-conditioned heatmaps. view at source ↗

read the original abstract

Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterPartAbility adds a PPIM module plus attention constraints for part-level grounding in TI-ReID and adapts perturbation metrics for quantitative interpretability, but the binding quality lacks clear independent checks.

read the letter

They introduce a lightweight patch-phrase interaction module (PPIM) to supervise concept-level guidance and constrain CLIP ViT self-attention for spatially concentrated activations tied to part phrases. They also adapt perturbation metrics, such as counterfactual region masking, to measure how much removing top explanatory patches degrades retrieval performance. This moves past slot-attention visualizations toward something more explicit and measurable on standard benchmarks like CUHK-PEDES and ICFG-PEDES, while keeping retrieval accuracy competitive. Code is included, which helps reproducibility in this area. The work does a solid job of naming a real limitation in prior TI-ReID interpretability and offering concrete components to address it. The main soft spot is the one the stress-test flags. The explanatory regions come from the same constrained attention that the metrics then test by masking, so higher scores could reflect tighter attention maps without confirming those regions actually match the semantic intent of the phrases. The abstract gives no sign of separate grounding validation, such as human part annotations or independent alignment scores, to break that loop. If the full experiments include such checks, the claim strengthens; otherwise it stays a concern. This paper is for researchers working on interpretable vision-language retrieval, especially in security or surveillance applications. A reader who wants practical modules and evaluation protocols for part-wise explanations would get direct value from it. It deserves peer review because the ideas are specific, the benchmarks are relevant, and the gaps are fixable with more validation rather than fatal. Send it to referees.

Referee Report

2 major / 1 minor

Summary. The paper proposes InterPartAbility, a method for text-to-image person re-identification (TI-ReID) that aims to improve interpretability over prior slot-attention approaches. It introduces an open-vocabulary patch-phrase interaction module (PPIM) for lightweight concept-level supervision to encourage part-wise matching between text phrases and image regions, along with constraints on CLIP ViT self-attention to produce spatially concentrated activations. A quantitative interpretability evaluation protocol is defined by adapting perturbation-based metrics, notably counterfactual region masking that measures retrieval performance drop after removing top explanatory patches. Experiments on CUHK-PEDES and ICFG-PEDES are claimed to achieve state-of-the-art interpretability scores while maintaining competitive retrieval accuracy.

Significance. If the grounding quality and metric validity hold, the work would meaningfully advance interpretable VLMs for ReID by shifting from qualitative visualizations to explicit phrase-region bindings and a standardized quantitative protocol. The commitment to releasing code supports reproducibility, which is a clear strength.

major comments (2)

[Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.
[Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.

minor comments (1)

[Abstract] The footnote on code availability is positive; ensure the supplementary materials include all training details, metric implementation code, and exact hyper-parameters for the attention constraints to enable full reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate. Our responses aim to strengthen the presentation of the work without misrepresenting the contributions.

read point-by-point responses

Referee: [Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.

Authors: We appreciate the referee's identification of this potential circularity in the evaluation design. The counterfactual region masking metric follows established perturbation-based protocols in interpretability research, where the goal is to assess the causal contribution of the identified regions to the downstream retrieval task rather than solely relying on attention concentration. The PPIM provides explicit phrase-to-patch supervision during training to encourage semantic alignment, and the metric then measures whether these regions are functionally important. That said, we acknowledge that this does not constitute fully independent semantic verification (e.g., via human part annotations, which are unavailable in standard TI-ReID benchmarks like CUHK-PEDES and ICFG-PEDES). In the revised manuscript, we will add a dedicated discussion of this limitation in the evaluation protocol section, include supplementary cross-modal alignment scores (phrase embedding similarity to masked regions), and clarify how the training constraints mitigate but do not fully eliminate the dependency. This constitutes a partial revision. revision: partial
Referee: [Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.

Authors: The abstract follows standard conventions by summarizing the method, contributions, and high-level outcomes without embedding specific numerical values or table references, which would exceed typical length and readability constraints. The full empirical support—including adapted perturbation metrics, exact SOTA interpretability scores, retrieval accuracies, baseline comparisons, and results on CUHK-PEDES and ICFG-PEDES—is provided in the Experiments section with accompanying tables. These demonstrate the claimed improvements in interpretability while maintaining competitive accuracy. We do not believe changes to the abstract are required, as the detailed evidence is already present in the main body for reviewers to assess. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel modules and adapted metrics.

full rationale

The paper introduces PPIM for concept-level guidance and constrains CLIP ViT self-attention to produce grounded maps, then evaluates via adapted perturbation metrics (counterfactual masking). No equations or steps reduce predictions to inputs by construction, no self-citation chains load-bear the central claims, and no renaming of known results occurs. The method is self-contained against external benchmarks like CUHK-PEDES, with interpretability claims resting on new supervision rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identified; the approach builds on existing CLIP ViT models with new supervision and constraints.

pith-pipeline@v0.9.0 · 5585 in / 1200 out tokens · 57911 ms · 2026-05-07T09:25:00.195583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2504.12197 (2025) 2, 4

Alehdaghi, M., Bhattacharya, R., Shamsolmoali, P., Cruz, R.M., Heritier, M., Granger, E.: Beyond patches: Mining interpretable part-prototypes for explain- able ai. arXiv preprint arXiv:2504.12197 (2025) 2, 4

work page arXiv 2025
[2]

In: CVPR (2025) 4

Bai, Y., Ji, Y., Cao, M., Wang, J., Ye, M.: Chat-based person retrieval via dialogue- refined cross-modal alignment. In: CVPR (2025) 4

work page 2025
[3]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

work page 2024
[4]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpret- ing bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 397–406 (2021) 9

work page 2021
[5]

Advances in neural information processing systems32(2019) 4

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019) 4

work page 2019
[6]

In: Scandinavian Conference on Image Analysis

Cohen, D., Chefer, H., Wolf, L.: A meaningful perturbation metric for evaluating explainability methods. In: Scandinavian Conference on Image Analysis. pp. 309–

work page
[7]

Semantically self-aligned network for text-to- image part-aware person re-identification

Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021) 2, 3, 11, 15

work page arXiv 2021
[8]

ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

Ergasti, A., Fontanini, T., Ferrari, C., Bertozzi, M., Prati, A.: Mars: Paying more attention to visual attributes for text-based person search. ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

work page 2025
[9]

arXiv preprint arXiv:2101.03036 (2021) 3, 15

Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Peng, P., Guo, X., Sun, X.: Contextual non-local alignment over full-scale representation for text- based person search. arXiv preprint arXiv:2101.03036 (2021) 3, 15

work page arXiv 2021
[10]

Heritier, M., Mekhazni, D., Leblond-Menard, C., Godbout, B., Guilbaud, N., Ale- hdaghi, M., Granger, E.: Exam: Unsupervised concept-based representation learn- ingtobetterexplainmodelsinvisiontasks.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 2750–2759 (2025) 2, 4

work page 2025
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to- image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023) 3, 11, 14, 15, 17

work page 2023
[12]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi- granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

work page 2020
[13]

In: International conference on machine learning

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Inter- pretabilitybeyondfeatureattribution:Quantitativetestingwithconceptactivation vectors (tcav). In: International conference on machine learning. pp. 2668–2677. PMLR (2018) 4

work page 2018
[14]

Neurocomputing p

Kim, G., Eom, C.: Dico: Disentangled concept representation for text-to-image person re-identification. Neurocomputing p. 132885 (2026) 2, 4, 8, 11, 14, 16, 17

work page 2026
[15]

In: CVPR (2017) 2, 3, 11, 15

Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR (2017) 2, 3, 11, 15

work page 2017
[16]

In: European conference on com- puter vision

Liao, S., Shao, L.: Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In: European conference on com- puter vision. pp. 456–474. Springer (2020) 4 24 S. Murtaza et al

work page 2020
[17]

Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot atten- tion. Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

work page 2020
[18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nauta, M., Schlötterer, J., Van Keulen, M., Seifert, C.: Pip-net: Patch-based in- tuitive prototypes for interpretable image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2744–2753 (2023) 4

work page 2023
[19]

In: European Conference on Com- puter Vision

Park, J., Kim, D., Jeong, B., Kwak, S.: Plot: Text-based person search with part slot attention for corresponding part discovery. In: European Conference on Com- puter Vision. pp. 474–490. Springer (2024) 2, 4, 8, 11, 14, 17

work page 2024
[20]

In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

Qin, Y., Chen, C., Fu, Z., Peng, D., Peng, X., Hu, P.: Human-centered interactive learning via mllms for text-to-image person re-identification. In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

work page 2025
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27197– 27206 (2024) 2, 4, 5, 16

work page 2024
[22]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 3, 15, 16

work page 2021
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tan, W., Ding, C., Jiang, J., Wang, F., Zhan, Y., Tao, D.: Harnessing the power of mllms for transferable text-to-image person reid. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17127–17137 (2024) 11, 14, 16, 17

work page 2024
[24]

In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16

Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. pp. 402–420. Springer (2020) 3, 15

work page 2020
[25]

IEEE Transactions on Image Processing (2023) 4, 15

Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023) 4, 15

work page 2023
[26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 16

work page internal anchor Pith review arXiv 2025
[27]

In: Proceedings of the 31st ACM international conference on multimedia

Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM international conference on multimedia. pp. 4492–4501 (2023) 11, 14, 16, 17

work page 2023
[28]

In: Proceedings of the European conference on computer vision (ECCV)

Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 686– 701 (2018) 3, 15

work page 2018
[29]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty mod- eling and semantic alignment for text-to-image person re-identification. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

work page 2024
[30]

In: Proceedings of the 29th ACM International Conference on Multimedia

Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia. p. 209–217. MM ’21 (2021) 2, 11 InterPartAbility25

work page 2021
[31]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Zuo, J., Zhou, H., Nie, Y., Zhang, F., Guo, T., Sang, N., Wang, Y., Gao, C.: Ufinebench: Towards text-based person retrieval with ultra-fine granularity. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 22010–22019 (2024) 2, 4

work page 2024

[1] [1]

arXiv preprint arXiv:2504.12197 (2025) 2, 4

Alehdaghi, M., Bhattacharya, R., Shamsolmoali, P., Cruz, R.M., Heritier, M., Granger, E.: Beyond patches: Mining interpretable part-prototypes for explain- able ai. arXiv preprint arXiv:2504.12197 (2025) 2, 4

work page arXiv 2025

[2] [2]

In: CVPR (2025) 4

Bai, Y., Ji, Y., Cao, M., Wang, J., Ye, M.: Chat-based person retrieval via dialogue- refined cross-modal alignment. In: CVPR (2025) 4

work page 2025

[3] [3]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

work page 2024

[4] [4]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpret- ing bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 397–406 (2021) 9

work page 2021

[5] [5]

Advances in neural information processing systems32(2019) 4

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019) 4

work page 2019

[6] [6]

In: Scandinavian Conference on Image Analysis

Cohen, D., Chefer, H., Wolf, L.: A meaningful perturbation metric for evaluating explainability methods. In: Scandinavian Conference on Image Analysis. pp. 309–

work page

[7] [7]

Semantically self-aligned network for text-to- image part-aware person re-identification

Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021) 2, 3, 11, 15

work page arXiv 2021

[8] [8]

ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

Ergasti, A., Fontanini, T., Ferrari, C., Bertozzi, M., Prati, A.: Mars: Paying more attention to visual attributes for text-based person search. ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

work page 2025

[9] [9]

arXiv preprint arXiv:2101.03036 (2021) 3, 15

Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Peng, P., Guo, X., Sun, X.: Contextual non-local alignment over full-scale representation for text- based person search. arXiv preprint arXiv:2101.03036 (2021) 3, 15

work page arXiv 2021

[10] [10]

Heritier, M., Mekhazni, D., Leblond-Menard, C., Godbout, B., Guilbaud, N., Ale- hdaghi, M., Granger, E.: Exam: Unsupervised concept-based representation learn- ingtobetterexplainmodelsinvisiontasks.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 2750–2759 (2025) 2, 4

work page 2025

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to- image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023) 3, 11, 14, 15, 17

work page 2023

[12] [12]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi- granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

work page 2020

[13] [13]

In: International conference on machine learning

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Inter- pretabilitybeyondfeatureattribution:Quantitativetestingwithconceptactivation vectors (tcav). In: International conference on machine learning. pp. 2668–2677. PMLR (2018) 4

work page 2018

[14] [14]

Neurocomputing p

Kim, G., Eom, C.: Dico: Disentangled concept representation for text-to-image person re-identification. Neurocomputing p. 132885 (2026) 2, 4, 8, 11, 14, 16, 17

work page 2026

[15] [15]

In: CVPR (2017) 2, 3, 11, 15

Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR (2017) 2, 3, 11, 15

work page 2017

[16] [16]

In: European conference on com- puter vision

Liao, S., Shao, L.: Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In: European conference on com- puter vision. pp. 456–474. Springer (2020) 4 24 S. Murtaza et al

work page 2020

[17] [17]

Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot atten- tion. Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

work page 2020

[18] [18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nauta, M., Schlötterer, J., Van Keulen, M., Seifert, C.: Pip-net: Patch-based in- tuitive prototypes for interpretable image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2744–2753 (2023) 4

work page 2023

[19] [19]

In: European Conference on Com- puter Vision

Park, J., Kim, D., Jeong, B., Kwak, S.: Plot: Text-based person search with part slot attention for corresponding part discovery. In: European Conference on Com- puter Vision. pp. 474–490. Springer (2024) 2, 4, 8, 11, 14, 17

work page 2024

[20] [20]

In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

Qin, Y., Chen, C., Fu, Z., Peng, D., Peng, X., Hu, P.: Human-centered interactive learning via mllms for text-to-image person re-identification. In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

work page 2025

[21] [21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27197– 27206 (2024) 2, 4, 5, 16

work page 2024

[22] [22]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 3, 15, 16

work page 2021

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tan, W., Ding, C., Jiang, J., Wang, F., Zhan, Y., Tao, D.: Harnessing the power of mllms for transferable text-to-image person reid. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17127–17137 (2024) 11, 14, 16, 17

work page 2024

[24] [24]

In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16

Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. pp. 402–420. Springer (2020) 3, 15

work page 2020

[25] [25]

IEEE Transactions on Image Processing (2023) 4, 15

Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023) 4, 15

work page 2023

[26] [26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 16

work page internal anchor Pith review arXiv 2025

[27] [27]

In: Proceedings of the 31st ACM international conference on multimedia

Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM international conference on multimedia. pp. 4492–4501 (2023) 11, 14, 16, 17

work page 2023

[28] [28]

In: Proceedings of the European conference on computer vision (ECCV)

Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 686– 701 (2018) 3, 15

work page 2018

[29] [29]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty mod- eling and semantic alignment for text-to-image person re-identification. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

work page 2024

[30] [30]

In: Proceedings of the 29th ACM International Conference on Multimedia

Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia. p. 209–217. MM ’21 (2021) 2, 11 InterPartAbility25

work page 2021

[31] [31]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Zuo, J., Zhou, H., Nie, Y., Zhang, F., Guo, T., Sang, N., Wang, Y., Gao, C.: Ufinebench: Towards text-based person retrieval with ultra-fine granularity. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 22010–22019 (2024) 2, 4

work page 2024