pith. sign in

arxiv: 2604.27122 · v1 · submitted 2026-04-29 · 💻 cs.CV

InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

Pith reviewed 2026-05-07 09:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords person re-identificationtext-to-image retrievalinterpretable AIpart matchingvision-language modelsexplainable retrieval
0
0 comments X

The pith

InterPartAbility adds text-guided part matching and constrained attention to produce grounded explanations for text-to-image person re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterPartAbility as a way to make text-to-image person re-identification interpretable by shifting from black-box matching to explicit part-wise alignment. It adds a lightweight patch-phrase interaction module that supplies concept-level supervision from part phrases in the text, training the model to link specific image regions to those phrases. The method further restricts the CLIP vision transformer's self-attention so that activations for each phrase become spatially concentrated, creating explanation maps that can be tested with perturbation metrics such as removing top-ranked regions and measuring retrieval drop. On standard benchmarks the approach reports stronger quantitative interpretability scores while retrieval accuracy stays competitive with prior models.

Core claim

By training with a patch-phrase interaction module that supplies open-vocabulary part-phrase guidance and by constraining CLIP ViT self-attention to produce localized patch activations, the model learns to bind visual regions directly to semantically meaningful phrases, yielding explanation maps whose quality can be measured by how much retrieval performance degrades when the highlighted regions are masked.

What carries the argument

The patch-phrase interaction module (PPIM) that supplies concept-level supervision to align image patches with part phrases extracted from the text description.

If this is right

  • Interpretability in text-to-image re-identification can be evaluated with quantitative perturbation metrics instead of relying only on qualitative visualizations.
  • Explanations become open-vocabulary and tied to specific part phrases rather than limited to a closed set of concepts.
  • The same model can be used for both retrieval and for producing region-level evidence that supports each match.
  • Performance on standard retrieval benchmarks does not have to be traded off against improved explainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervision and attention constraint could be tested on other vision-language retrieval tasks where part-level grounding would help users verify decisions.
  • If the bindings prove stable across datasets, the approach might reduce the need for separate post-hoc explanation generators in deployed re-identification systems.
  • Quantitative grounding scores could become an additional training objective for future vision-language models that must justify matches to operators.

Load-bearing premise

The patch-phrase interaction module together with the constrained self-attention actually creates reliable bindings between image regions and the intended part phrases rather than spurious correlations.

What would settle it

An experiment that masks the highest-ranked explanatory regions identified by InterPartAbility and measures whether retrieval accuracy drops more than when the same number of randomly chosen regions are masked.

Figures

Figures reproduced from arXiv: 2604.27122 by Aryan Shukla, Eric Granger, Maguelonne Heritier, Rajarshi Bhattacharya, Shakeeb Murtaza.

Figure 1
Figure 1. Figure 1: TI-ReID alignment paradigms. (a) Global matching: CLIP-based methods pro￾duce global image-text similarity, offering no insight into which regions. (b) Concept￾level matching (PLOT, DiCo): slot attention decomposes features into concept re￾gions but fails to bind slots to specific textual phrases, yielding unlabelled qualita￾tive visualizations with high computational cost due to slots. (c) InterPartAbilit… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of InterPartAbility. An image and caption are encoded by CLIP encoders EI and ET . Global embeddings are trained with the base retrieval objec￾tive Lbase. The image encoder additionally produces patch embeddings Zi ∈ R K×D. Each appearance phrase ℓi,p is encoded into a phrase embedding Hi ∈ R P ×D. The Patch-Phrase Interaction Module computes phrase-patch similarity and softly aggre￾gates patch fe… view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of relevance-based masking. (a) view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of phrase-conditioned heatmaps. view at source ↗
read the original abstract

Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes InterPartAbility, a method for text-to-image person re-identification (TI-ReID) that aims to improve interpretability over prior slot-attention approaches. It introduces an open-vocabulary patch-phrase interaction module (PPIM) for lightweight concept-level supervision to encourage part-wise matching between text phrases and image regions, along with constraints on CLIP ViT self-attention to produce spatially concentrated activations. A quantitative interpretability evaluation protocol is defined by adapting perturbation-based metrics, notably counterfactual region masking that measures retrieval performance drop after removing top explanatory patches. Experiments on CUHK-PEDES and ICFG-PEDES are claimed to achieve state-of-the-art interpretability scores while maintaining competitive retrieval accuracy.

Significance. If the grounding quality and metric validity hold, the work would meaningfully advance interpretable VLMs for ReID by shifting from qualitative visualizations to explicit phrase-region bindings and a standardized quantitative protocol. The commitment to releasing code supports reproducibility, which is a clear strength.

major comments (2)
  1. [Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.
  2. [Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.
minor comments (1)
  1. [Abstract] The footnote on code availability is positive; ensure the supplementary materials include all training details, metric implementation code, and exact hyper-parameters for the attention constraints to enable full reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate. Our responses aim to strengthen the presentation of the work without misrepresenting the contributions.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.

    Authors: We appreciate the referee's identification of this potential circularity in the evaluation design. The counterfactual region masking metric follows established perturbation-based protocols in interpretability research, where the goal is to assess the causal contribution of the identified regions to the downstream retrieval task rather than solely relying on attention concentration. The PPIM provides explicit phrase-to-patch supervision during training to encourage semantic alignment, and the metric then measures whether these regions are functionally important. That said, we acknowledge that this does not constitute fully independent semantic verification (e.g., via human part annotations, which are unavailable in standard TI-ReID benchmarks like CUHK-PEDES and ICFG-PEDES). In the revised manuscript, we will add a dedicated discussion of this limitation in the evaluation protocol section, include supplementary cross-modal alignment scores (phrase embedding similarity to masked regions), and clarify how the training constraints mitigate but do not fully eliminate the dependency. This constitutes a partial revision. revision: partial

  2. Referee: [Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.

    Authors: The abstract follows standard conventions by summarizing the method, contributions, and high-level outcomes without embedding specific numerical values or table references, which would exceed typical length and readability constraints. The full empirical support—including adapted perturbation metrics, exact SOTA interpretability scores, retrieval accuracies, baseline comparisons, and results on CUHK-PEDES and ICFG-PEDES—is provided in the Experiments section with accompanying tables. These demonstrate the claimed improvements in interpretability while maintaining competitive accuracy. We do not believe changes to the abstract are required, as the detailed evidence is already present in the main body for reviewers to assess. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel modules and adapted metrics.

full rationale

The paper introduces PPIM for concept-level guidance and constrains CLIP ViT self-attention to produce grounded maps, then evaluates via adapted perturbation metrics (counterfactual masking). No equations or steps reduce predictions to inputs by construction, no self-citation chains load-bear the central claims, and no renaming of known results occurs. The method is self-contained against external benchmarks like CUHK-PEDES, with interpretability claims resting on new supervision rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identified; the approach builds on existing CLIP ViT models with new supervision and constraints.

pith-pipeline@v0.9.0 · 5585 in / 1200 out tokens · 57911 ms · 2026-05-07T09:25:00.195583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2504.12197 (2025) 2, 4

    Alehdaghi, M., Bhattacharya, R., Shamsolmoali, P., Cruz, R.M., Heritier, M., Granger, E.: Beyond patches: Mining interpretable part-prototypes for explain- able ai. arXiv preprint arXiv:2504.12197 (2025) 2, 4

  2. [2]

    In: CVPR (2025) 4

    Bai, Y., Ji, Y., Cao, M., Wang, J., Ye, M.: Chat-based person retrieval via dialogue- refined cross-modal alignment. In: CVPR (2025) 4

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

    Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15

  4. [4]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpret- ing bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 397–406 (2021) 9

  5. [5]

    Advances in neural information processing systems32(2019) 4

    Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019) 4

  6. [6]

    In: Scandinavian Conference on Image Analysis

    Cohen, D., Chefer, H., Wolf, L.: A meaningful perturbation metric for evaluating explainability methods. In: Scandinavian Conference on Image Analysis. pp. 309–

  7. [7]

    Semantically self-aligned network for text-to- image part-aware person re-identification

    Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021) 2, 3, 11, 15

  8. [8]

    ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

    Ergasti, A., Fontanini, T., Ferrari, C., Bertozzi, M., Prati, A.: Mars: Paying more attention to visual attributes for text-based person search. ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17

  9. [9]

    arXiv preprint arXiv:2101.03036 (2021) 3, 15

    Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Peng, P., Guo, X., Sun, X.: Contextual non-local alignment over full-scale representation for text- based person search. arXiv preprint arXiv:2101.03036 (2021) 3, 15

  10. [10]

    Heritier, M., Mekhazni, D., Leblond-Menard, C., Godbout, B., Guilbaud, N., Ale- hdaghi, M., Granger, E.: Exam: Unsupervised concept-based representation learn- ingtobetterexplainmodelsinvisiontasks.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 2750–2759 (2025) 2, 4

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to- image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023) 3, 11, 14, 15, 17

  12. [12]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

    Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi- granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15

  13. [13]

    In: International conference on machine learning

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Inter- pretabilitybeyondfeatureattribution:Quantitativetestingwithconceptactivation vectors (tcav). In: International conference on machine learning. pp. 2668–2677. PMLR (2018) 4

  14. [14]

    Neurocomputing p

    Kim, G., Eom, C.: Dico: Disentangled concept representation for text-to-image person re-identification. Neurocomputing p. 132885 (2026) 2, 4, 8, 11, 14, 16, 17

  15. [15]

    In: CVPR (2017) 2, 3, 11, 15

    Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR (2017) 2, 3, 11, 15

  16. [16]

    In: European conference on com- puter vision

    Liao, S., Shao, L.: Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In: European conference on com- puter vision. pp. 456–474. Springer (2020) 4 24 S. Murtaza et al

  17. [17]

    Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

    Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot atten- tion. Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Nauta, M., Schlötterer, J., Van Keulen, M., Seifert, C.: Pip-net: Patch-based in- tuitive prototypes for interpretable image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2744–2753 (2023) 4

  19. [19]

    In: European Conference on Com- puter Vision

    Park, J., Kim, D., Jeong, B., Kwak, S.: Plot: Text-based person search with part slot attention for corresponding part discovery. In: European Conference on Com- puter Vision. pp. 474–490. Springer (2024) 2, 4, 8, 11, 14, 17

  20. [20]

    In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

    Qin, Y., Chen, C., Fu, Z., Peng, D., Peng, X., Hu, P.: Human-centered interactive learning via mllms for text-to-image person re-identification. In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27197– 27206 (2024) 2, 4, 5, 16

  22. [22]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 3, 15, 16

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tan, W., Ding, C., Jiang, J., Wang, F., Zhan, Y., Tao, D.: Harnessing the power of mllms for transferable text-to-image person reid. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17127–17137 (2024) 11, 14, 16, 17

  24. [24]

    In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16

    Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. pp. 402–420. Springer (2020) 3, 15

  25. [25]

    IEEE Transactions on Image Processing (2023) 4, 15

    Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023) 4, 15

  26. [26]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 16

  27. [27]

    In: Proceedings of the 31st ACM international conference on multimedia

    Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM international conference on multimedia. pp. 4492–4501 (2023) 11, 14, 16, 17

  28. [28]

    In: Proceedings of the European conference on computer vision (ECCV)

    Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 686– 701 (2018) 3, 15

  29. [29]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

    Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty mod- eling and semantic alignment for text-to-image person re-identification. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16

  30. [30]

    In: Proceedings of the 29th ACM International Conference on Multimedia

    Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia. p. 209–217. MM ’21 (2021) 2, 11 InterPartAbility25

  31. [31]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Zuo, J., Zhou, H., Nie, Y., Zhang, F., Guo, T., Sang, N., Wang, Y., Gao, C.: Ufinebench: Towards text-based person retrieval with ultra-fine granularity. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 22010–22019 (2024) 2, 4