InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification
Pith reviewed 2026-05-07 09:25 UTC · model grok-4.3
The pith
InterPartAbility adds text-guided part matching and constrained attention to produce grounded explanations for text-to-image person re-identification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training with a patch-phrase interaction module that supplies open-vocabulary part-phrase guidance and by constraining CLIP ViT self-attention to produce localized patch activations, the model learns to bind visual regions directly to semantically meaningful phrases, yielding explanation maps whose quality can be measured by how much retrieval performance degrades when the highlighted regions are masked.
What carries the argument
The patch-phrase interaction module (PPIM) that supplies concept-level supervision to align image patches with part phrases extracted from the text description.
If this is right
- Interpretability in text-to-image re-identification can be evaluated with quantitative perturbation metrics instead of relying only on qualitative visualizations.
- Explanations become open-vocabulary and tied to specific part phrases rather than limited to a closed set of concepts.
- The same model can be used for both retrieval and for producing region-level evidence that supports each match.
- Performance on standard retrieval benchmarks does not have to be traded off against improved explainability.
Where Pith is reading between the lines
- The same supervision and attention constraint could be tested on other vision-language retrieval tasks where part-level grounding would help users verify decisions.
- If the bindings prove stable across datasets, the approach might reduce the need for separate post-hoc explanation generators in deployed re-identification systems.
- Quantitative grounding scores could become an additional training objective for future vision-language models that must justify matches to operators.
Load-bearing premise
The patch-phrase interaction module together with the constrained self-attention actually creates reliable bindings between image regions and the intended part phrases rather than spurious correlations.
What would settle it
An experiment that masks the highest-ranked explanatory regions identified by InterPartAbility and measures whether retrieval accuracy drops more than when the same number of randomly chosen regions are masked.
Figures
read the original abstract
Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes InterPartAbility, a method for text-to-image person re-identification (TI-ReID) that aims to improve interpretability over prior slot-attention approaches. It introduces an open-vocabulary patch-phrase interaction module (PPIM) for lightweight concept-level supervision to encourage part-wise matching between text phrases and image regions, along with constraints on CLIP ViT self-attention to produce spatially concentrated activations. A quantitative interpretability evaluation protocol is defined by adapting perturbation-based metrics, notably counterfactual region masking that measures retrieval performance drop after removing top explanatory patches. Experiments on CUHK-PEDES and ICFG-PEDES are claimed to achieve state-of-the-art interpretability scores while maintaining competitive retrieval accuracy.
Significance. If the grounding quality and metric validity hold, the work would meaningfully advance interpretable VLMs for ReID by shifting from qualitative visualizations to explicit phrase-region bindings and a standardized quantitative protocol. The commitment to releasing code supports reproducibility, which is a clear strength.
major comments (2)
- [Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.
- [Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.
minor comments (1)
- [Abstract] The footnote on code availability is positive; ensure the supplementary materials include all training details, metric implementation code, and exact hyper-parameters for the attention constraints to enable full reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate. Our responses aim to strengthen the presentation of the work without misrepresenting the contributions.
read point-by-point responses
-
Referee: [Abstract / Evaluation Protocol] Abstract and evaluation protocol section: The interpretability claims rest on perturbation metrics (counterfactual region masking) that derive top explanatory patches directly from the constrained self-attention maps produced by the same PPIM and attention mechanism under test. This creates a potential circularity where metric improvements could result from more concentrated attention patterns without independent confirmation that the activated regions semantically match the part phrases (e.g., 'red shirt' aligning with torso pixels rather than background). No cross-validation such as human part annotations or separate cross-modal alignment scores is described to break this dependency.
Authors: We appreciate the referee's identification of this potential circularity in the evaluation design. The counterfactual region masking metric follows established perturbation-based protocols in interpretability research, where the goal is to assess the causal contribution of the identified regions to the downstream retrieval task rather than solely relying on attention concentration. The PPIM provides explicit phrase-to-patch supervision during training to encourage semantic alignment, and the metric then measures whether these regions are functionally important. That said, we acknowledge that this does not constitute fully independent semantic verification (e.g., via human part annotations, which are unavailable in standard TI-ReID benchmarks like CUHK-PEDES and ICFG-PEDES). In the revised manuscript, we will add a dedicated discussion of this limitation in the evaluation protocol section, include supplementary cross-modal alignment scores (phrase embedding similarity to masked regions), and clarify how the training constraints mitigate but do not fully eliminate the dependency. This constitutes a partial revision. revision: partial
-
Referee: [Abstract] Abstract: The claim of 'state-of-the-art (SOTA) interpretability performance' and 'competitive retrieval accuracy' is stated without any numerical results, specific metric adaptations, baseline comparisons, or table references. This makes the central empirical contribution impossible to assess from the provided description and requires the full results section (including any tables on CUHK-PEDES and ICFG-PEDES) to verify whether the data actually supports the SOTA assertion.
Authors: The abstract follows standard conventions by summarizing the method, contributions, and high-level outcomes without embedding specific numerical values or table references, which would exceed typical length and readability constraints. The full empirical support—including adapted perturbation metrics, exact SOTA interpretability scores, retrieval accuracies, baseline comparisons, and results on CUHK-PEDES and ICFG-PEDES—is provided in the Experiments section with accompanying tables. These demonstrate the claimed improvements in interpretability while maintaining competitive accuracy. We do not believe changes to the abstract are required, as the detailed evidence is already present in the main body for reviewers to assess. revision: no
Circularity Check
No significant circularity; derivation relies on novel modules and adapted metrics.
full rationale
The paper introduces PPIM for concept-level guidance and constrains CLIP ViT self-attention to produce grounded maps, then evaluates via adapted perturbation metrics (counterfactual masking). No equations or steps reduce predictions to inputs by construction, no self-citation chains load-bear the central claims, and no renaming of known results occurs. The method is self-contained against external benchmarks like CUHK-PEDES, with interpretability claims resting on new supervision rather than tautological redefinitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2504.12197 (2025) 2, 4
Alehdaghi, M., Bhattacharya, R., Shamsolmoali, P., Cruz, R.M., Heritier, M., Granger, E.: Beyond patches: Mining interpretable part-prototypes for explain- able ai. arXiv preprint arXiv:2504.12197 (2025) 2, 4
-
[2]
Bai, Y., Ji, Y., Cao, M., Wang, J., Ye, M.: Chat-based person retrieval via dialogue- refined cross-modal alignment. In: CVPR (2025) 4
work page 2025
-
[3]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15
Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 15
work page 2024
-
[4]
In: Proceedings of the IEEE/CVF international conference on computer vision
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpret- ing bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 397–406 (2021) 9
work page 2021
-
[5]
Advances in neural information processing systems32(2019) 4
Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019) 4
work page 2019
-
[6]
In: Scandinavian Conference on Image Analysis
Cohen, D., Chefer, H., Wolf, L.: A meaningful perturbation metric for evaluating explainability methods. In: Scandinavian Conference on Image Analysis. pp. 309–
-
[7]
Semantically self-aligned network for text-to- image part-aware person re-identification
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to- image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021) 2, 3, 11, 15
-
[8]
ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17
Ergasti, A., Fontanini, T., Ferrari, C., Bertozzi, M., Prati, A.: Mars: Paying more attention to visual attributes for text-based person search. ACM Transactions on Multimedia Computing, Communications and Applications21(10), 1–22 (2025) 14, 17
work page 2025
-
[9]
arXiv preprint arXiv:2101.03036 (2021) 3, 15
Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Peng, P., Guo, X., Sun, X.: Contextual non-local alignment over full-scale representation for text- based person search. arXiv preprint arXiv:2101.03036 (2021) 3, 15
-
[10]
Heritier, M., Mekhazni, D., Leblond-Menard, C., Godbout, B., Guilbaud, N., Ale- hdaghi, M., Granger, E.: Exam: Unsupervised concept-based representation learn- ingtobetterexplainmodelsinvisiontasks.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 2750–2759 (2025) 2, 4
work page 2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to- image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023) 3, 11, 14, 15, 17
work page 2023
-
[12]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi- granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 3, 15
work page 2020
-
[13]
In: International conference on machine learning
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Inter- pretabilitybeyondfeatureattribution:Quantitativetestingwithconceptactivation vectors (tcav). In: International conference on machine learning. pp. 2668–2677. PMLR (2018) 4
work page 2018
-
[14]
Kim, G., Eom, C.: Dico: Disentangled concept representation for text-to-image person re-identification. Neurocomputing p. 132885 (2026) 2, 4, 8, 11, 14, 16, 17
work page 2026
-
[15]
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR (2017) 2, 3, 11, 15
work page 2017
-
[16]
In: European conference on com- puter vision
Liao, S., Shao, L.: Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In: European conference on com- puter vision. pp. 456–474. Springer (2020) 4 24 S. Murtaza et al
work page 2020
-
[17]
Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8
Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot atten- tion. Advances in neural information processing systems33, 11525–11538 (2020) 2, 4, 8
work page 2020
-
[18]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Nauta, M., Schlötterer, J., Van Keulen, M., Seifert, C.: Pip-net: Patch-based in- tuitive prototypes for interpretable image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2744–2753 (2023) 4
work page 2023
-
[19]
In: European Conference on Com- puter Vision
Park, J., Kim, D., Jeong, B., Kwak, S.: Plot: Text-based person search with part slot attention for corresponding part discovery. In: European Conference on Com- puter Vision. pp. 474–490. Springer (2024) 2, 4, 8, 11, 14, 17
work page 2024
-
[20]
In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17
Qin, Y., Chen, C., Fu, Z., Peng, D., Peng, X., Hu, P.: Human-centered interactive learning via mllms for text-to-image person re-identification. In: CVPR (2025) 4, 5, 8, 11, 14, 16, 17
work page 2025
-
[21]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27197– 27206 (2024) 2, 4, 5, 16
work page 2024
-
[22]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 3, 15, 16
work page 2021
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Tan, W., Ding, C., Jiang, J., Wang, F., Zhan, Y., Tao, D.: Harnessing the power of mllms for transferable text-to-image person reid. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17127–17137 (2024) 11, 14, 16, 17
work page 2024
-
[24]
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. pp. 402–420. Springer (2020) 3, 15
work page 2020
-
[25]
IEEE Transactions on Image Processing (2023) 4, 15
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023) 4, 15
work page 2023
-
[26]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 16
work page internal anchor Pith review arXiv 2025
-
[27]
In: Proceedings of the 31st ACM international conference on multimedia
Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM international conference on multimedia. pp. 4492–4501 (2023) 11, 14, 16, 17
work page 2023
-
[28]
In: Proceedings of the European conference on computer vision (ECCV)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 686– 701 (2018) 3, 15
work page 2018
-
[29]
In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16
Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty mod- eling and semantic alignment for text-to-image person re-identification. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2024) 4, 16
work page 2024
-
[30]
In: Proceedings of the 29th ACM International Conference on Multimedia
Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia. p. 209–217. MM ’21 (2021) 2, 11 InterPartAbility25
work page 2021
-
[31]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition
Zuo, J., Zhou, H., Nie, Y., Zhang, F., Guo, T., Sang, N., Wang, Y., Gao, C.: Ufinebench: Towards text-based person retrieval with ultra-fine granularity. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 22010–22019 (2024) 2, 4
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.