pith. sign in

arxiv: 2605.19527 · v1 · pith:SSYWHTR3new · submitted 2026-05-19 · 💻 cs.CV

Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

Pith reviewed 2026-05-20 06:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords Occluded Person Re-IdentificationCLIPDual Prompt LearningFeature FusionData AugmentationComputer Vision
0
0 comments X

The pith

Dual prompts in CLIP capture full pedestrian semantics despite occlusions for better re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that incorporating dual prompt learning allows CLIP models to use textual information to understand complete pedestrian appearances even when parts are occluded. By adding a real-world occlusion augmentation technique, the training data better reflects actual blocking situations, and a gated fusion method helps extract more complete features. A sympathetic reader would care because person re-identification in real settings often deals with crowded scenes where visibility is limited, and improving accuracy here could lead to more reliable tracking systems. The experiments demonstrate superior results on standard occluded datasets.

Core claim

The authors claim that their DPL-ReID model, featuring a Dual Prompt Learning strategy for capturing complete pedestrian semantics via textual cues and robustness to occlusion, a Real-World Occlusion Augmentation method to simulate realistic occlusions, and a Weighted Gated Feature Fusion using LSNet for guiding the CLIP visual encoder, achieves state-of-the-art performance on occluded person re-identification benchmarks.

What carries the argument

The Dual Prompt Learning (Dual-PL) strategy combined with Real-World Occlusion Augmentation (RWOA) and Weighted Gated Feature Fusion (WGFF) mechanism.

If this is right

  • Improved robustness in matching partially visible pedestrians across camera views.
  • More comprehensive feature representations generated by the CLIP visual encoder.
  • Enhanced training through enrichment of occluded samples with realistic simulations.
  • State-of-the-art results on benchmark occluded ReID datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dual-prompt techniques could be explored for other partial information tasks in computer vision, such as recognizing objects under heavy clutter.
  • Extending the occlusion library with more diverse real-world examples might increase the model's applicability to varied environments like urban streets or indoor spaces.
  • The approach opens possibilities for integrating additional modalities, like depth information, to further mitigate occlusion effects.

Load-bearing premise

That the proposed Real-World Occlusion Augmentation realistically simulates the occlusion scenarios found in the real world.

What would settle it

If experiments on occluded ReID datasets using occlusions not covered by the provided occlusion instance library show that DPL-ReID does not outperform existing methods, this would challenge the effectiveness of the augmentation and overall approach.

Figures

Figures reproduced from arXiv: 2605.19527 by Kai Feng, Shaotong Qiao, Wei Wei, Zhangjian Ji.

Figure 1
Figure 1. Figure 1: Current mainstream approaches for handling occluded person ReID: (a) Part-based matching methods, which ignore occlusion information. (b) Methods that introduce human pose cues to predict occluded regions. (c) Methods that add occluders to simulate real-world occlusion. Alignment A photo of a X X X X person. Text encoder Visual encoder Alignment A photo of a X X X X person. Text encoder Visual encoder A ph… view at source ↗
Figure 2
Figure 2. Figure 2: Two different person re-identification frameworks. (a) The CLIP-ReID framework, which aligns image and text features. (b) Our proposed DP-ReID, equipped with both holistic pedestrian prompts and occluded pedestrian prompts. observe that some existing ReID models rely excessively on a single body region for identification. To address this issue, we design a Weighted Gated Feature Fusion (WGFF) method and in… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of real occluded pedestrians and different augmentation strategies. (a) Real-world occlusion scenarios. (b) Simulate occlusion by randomp erasing. (c) OAMN augmentation. (d) OIA augmentation from the FCFormer. (e) Our proposed Real-World Occlusion Augmentation (RWOA). • Extensive experiments on benchmark occluded ReID datasets demonstrate that our approach is superior to state￾of-the-art (SOTA) me… view at source ↗
Figure 4
Figure 4. Figure 4: Some samples from the occlusion instance library. Pasting-based methods, such as OAMN (Chen et al., 2021), superimpose occluders from a background library onto pedestrian images, but these rarely match real-world patterns. SUREID (Song and Liu, 2024) employs realistic categories of occluders, such as bicycles, cars and umbrellas, yet represents them as non-instance rectangular patches, which results in vis… view at source ↗
Figure 5
Figure 5. Figure 5: The overall architecture of the dual-prompt learning CLIP hybrid visual encoder (DPL-ReID). DPL-ReID adopts a two-stage training strategy that extends CLIP-ReID with a Dual Prompt Learning module (Dual-PL) and a Weighted Gated Feature Fusion method (WGFF). In the first stage, two sets of image–text pairs are constructed. These two pairs correspond to the full-body semantics and occlusion semantics of pedes… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the effects of Distribution Alignment on occluded images. The first row adjusts the brightness of the occluder to match the surrounding environment. The second row reduces the contrast of the occluder relative to the background. WGFF Stem Projection Downsample Downsample Downsample LS Block LS Block LS Block MSA Block Block1 Block2 Block3 Block4 Feature fusion Shape Adjustment Layer1~2 Att… view at source ↗
Figure 7
Figure 7. Figure 7: Weighted gated feature fusion. Different fusion strategies are applied for CLIP visual encoders with different backbones. each occluder image, setting non-occluded regions to transparent to ensure correct spatial semantics. The occlusion instance is then resized to match the pedestrian image, its boundaries are feathered to reduce sharp edges, and finally overlaid. Some examples of the occlusion instance i… view at source ↗
Figure 8
Figure 8. Figure 8: Grad-CAM visualization. Left: ResNet50-based CLIP-ReID; Right: CLIP-ReID with LSNet. generate the corresponding occlusion image for the occlusion prompt text template, where each augmentation method adds or simulates occlusion at a 50% ratio. In [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization results of different prompt templates. (a) Template𝑓 𝑢𝑙𝑙 on full-body pedestrian images. (b) Template𝑓 𝑢𝑙𝑙 on occluded pedestrian images. (c) Template𝑜𝑐𝑐 on occluded pedestrian images. (d) Fusion of Template𝑓 𝑢𝑙𝑙 and Template𝑜𝑐𝑐 on occluded pedestrian images [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of occlusion ratios on Occluded-Duke and Occluded-REID. Impact of the number of learnable tokens N and M. We follow CLIP-ReID and set the number of learnable tokens in Template𝑓 𝑢𝑙𝑙 to M=4. We further investigate the effects of different numbers of M and N in Template𝑜𝑐𝑐 on the Occluded-Duke. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of ViT-based ReID. (a) original image, (b) CLIP-ReID, and (c) DPL-ReID. The left and right images are respectively test samples from Occluded-Duke and Occluded-REID. under occlusion, leading to consistent performance improvements. Therefore, the overhead is well controlled with a favorable trade-off between efficiency and performance. 4.5. Visualization Attention map visualization. We employ… view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of occlusion effects. The top row shows the original images, while the bottom row shows the occlusion images generated by our proposed RWOA. Query TOP-1 TOP-10 Query TOP-1 TOP-10 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Retrieval results of baseline CLIP-ReID (the first row) and ours DPL-ReID (the second row). The top row shows the baseline CLIP-ReID, while the bottom row shows DPL-ReID. The incorrectly identified samples are highlighted in red, while the green color highlights the correct samples. 5. Conclusion In this work, we revisit the role of occluders in occluded person ReID and propose a novel method, DPL-ReID. W… view at source ↗
read the original abstract

Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DPL-ReID, a CLIP-based model for occluded person re-identification. It introduces Dual Prompt Learning (Dual-PL) to leverage textual cues for complete pedestrian semantics robust to occlusion, Real-World Occlusion Augmentation (RWOA) to simulate and enrich occluded training samples from an instance library, and Weighted Gated Feature Fusion (WGFF) that uses LSNet to capture global information and gate features for more comprehensive CLIP visual encoder representations. The paper reports state-of-the-art results on several occluded ReID benchmarks and releases the occlusion library.

Significance. If the central performance claims hold after addressing validation gaps, the work would advance occluded ReID by combining vision-language prompting with targeted augmentation and hybrid encoding. The provision of the occlusion instance library and the focus on semantic handling of occluders represent concrete contributions that could support reproducibility and further research in robust cross-view matching.

major comments (2)
  1. [§3.2] §3.2 (RWOA description): The assertion that RWOA 'realistically simulates occlusion scenarios encountered in real world' is load-bearing for the SOTA claim, yet no quantitative validation is supplied (e.g., histograms of occlusion area ratios, spatial location heatmaps, or frequency of occlusion types) against the target benchmarks such as Occluded-Duke or Occluded-ReID. Without such comparison, performance gains may reflect dataset-specific fitting rather than improved generalization.
  2. [Experiments] Experiments section (tables reporting Rank-1/mAP): The abstract and method claims assert SOTA, but the manuscript must include explicit baseline comparisons, ablation studies isolating Dual-PL, RWOA, and WGFF, and error analysis on failure cases to substantiate the central performance result.
minor comments (2)
  1. [Abstract] Abstract: 'in corporates' should be 'incorporates'; 'real word' should be 'real world'; 'The occlusion instance library are available' should be 'is available'.
  2. [§3.3] Notation: Define the weighting function and gating operation in WGFF explicitly (e.g., the precise form of the LSNet output used as gate) to avoid ambiguity in the feature fusion equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing honest responses and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (RWOA description): The assertion that RWOA 'realistically simulates occlusion scenarios encountered in real world' is load-bearing for the SOTA claim, yet no quantitative validation is supplied (e.g., histograms of occlusion area ratios, spatial location heatmaps, or frequency of occlusion types) against the target benchmarks such as Occluded-Duke or Occluded-ReID. Without such comparison, performance gains may reflect dataset-specific fitting rather than improved generalization.

    Authors: We acknowledge the referee's point that quantitative validation would provide stronger support for the realism claim of RWOA. The method constructs occlusions using an instance library drawn from real-world images, but we agree that direct distributional comparisons to the benchmarks are valuable. In the revised manuscript, we will add a new analysis subsection under §3.2 (or as supplementary material) that includes histograms of occlusion area ratios, spatial location heatmaps, and frequency counts of occlusion types (e.g., upper-body, lower-body, side occlusions), directly comparing RWOA-augmented samples against the statistics of Occluded-Duke and Occluded-ReID. This will help demonstrate that performance improvements stem from better alignment with real occlusion distributions rather than overfitting. revision: yes

  2. Referee: [Experiments] Experiments section (tables reporting Rank-1/mAP): The abstract and method claims assert SOTA, but the manuscript must include explicit baseline comparisons, ablation studies isolating Dual-PL, RWOA, and WGFF, and error analysis on failure cases to substantiate the central performance result.

    Authors: The current experiments section already reports comparisons against multiple state-of-the-art occluded ReID methods (including recent CLIP-based approaches) and includes ablation studies on the proposed components. To more rigorously address the referee's request, we will revise the experiments section to (1) expand the baseline table with additional recent methods for clearer SOTA positioning, (2) present more granular ablation tables that isolate the incremental contribution of Dual-PL, RWOA, and WGFF (including combinations and individual removals), and (3) add a dedicated error analysis subsection that categorizes and discusses representative failure cases across the benchmarks, along with qualitative visualizations. These changes will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent components and external benchmarks

full rationale

The paper presents DPL-ReID as a composite of three explicitly introduced modules (Dual-PL, RWOA, WGFF) built on top of a pretrained CLIP backbone and evaluated on standard occluded ReID benchmarks. No equations, parameter-fitting steps, or derivation chains appear in the provided text that would reduce a claimed result to its own inputs by construction. The sole citation to prior CLIP-ReID work is used only to motivate the starting point, not to justify a uniqueness theorem or load-bearing assumption. Performance claims rest on reported experimental outcomes rather than any self-referential prediction or renaming of known patterns. This is the normal, non-circular case for an applied CV architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pretrained CLIP features plus textual prompts can recover complete pedestrian semantics under occlusion and that the proposed augmentation produces training distributions representative of real-world occlusions.

axioms (1)
  • domain assumption Pretrained CLIP visual and text encoders supply transferable features suitable for occluded ReID when augmented with prompts and gating
    The method is explicitly built on the success of CLIP-ReID.

pith-pipeline@v0.9.0 · 5759 in / 1239 out tokens · 49513 ms · 2026-05-20T06:32:01.353349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    Improved Regularization of Convolutional Neural Networks with Cutout

    Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 . Dosovitskiy, A.,

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Fu,D.,Chen,D.,Bao,J.,Yang,H.,Yuan,L.,Zhang,L.,Li,H.,Chen,D.,2021.Unsupervisedpre-trainingforpersonre-identification,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14750–14759. Gao, S., Wang, J., Lu, H., Liu, Z.,

  3. [3]

    11744–11752

    Pose-guided visible part matching for occluded person reid, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11744–11752. Han,S.,Liu,D.,Zhang,Z.,Ming,D.,2023. Spatialcomplementaryandself-repairlearningforoccludedpersonre-identification. Neurocomputing 546, 126360. He, K., Gkioxari, G., Dollár, P., Girshick, R.,

  4. [4]

    2961–2969

    Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. He,K.,Zhang,X.,Ren,S.,Sun,J.,2016. Deepresiduallearningforimagerecognition,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 770–778. He, S., Chen, W., Wang, K., Luo, H., Wang, F., Jiang, W., Ding, H.,

  5. [5]

    IEEE transactions on information forensics and security 19, 120–132

    Region generation and assessment network for occluded person re- identification. IEEE transactions on information forensics and security 19, 120–132. He,S.,Luo,H.,Wang,P.,Wang,F.,Li,H.,Jiang,W.,2021. Transreid:Transformer-basedobjectre-identification,in:ProceedingsoftheIEEE/CVF international conference on computer vision, pp. 15013–15022. Huang, K., Azfar...

  6. [6]

    arXiv preprint arXiv:2509.03032

    Background matters too: A language-enhanced adversarial framework for person re-identification. arXiv preprint arXiv:2509.03032 . Huang, X., Belongie, S.J.,

  7. [7]

    2017 IEEE International Conference on Computer Vision (ICCV) , 1510–1519URL:https://api.semanticscholar.org/CorpusID:6576859

    Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV) , 1510–1519URL:https://api.semanticscholar.org/CorpusID:6576859. Ji,Z.,Cheng,D.,Feng,K.,2025. Exploringstrongertransformerrepresentationlearningforoccludedpersonre-identification. MultimediaSystems 31,

  8. [8]

    Jia,M.,Cheng,X.,Zhai,Y.,Lu,S.,Ma,S.,Tian,Y.,Zhang,J.,2021

    URL:https://doi.org/10.1007/s00530-025-01986-0, doi:10.1007/s00530-025-01986-0. Jia,M.,Cheng,X.,Zhai,Y.,Lu,S.,Ma,S.,Tian,Y.,Zhang,J.,2021. Matchingonsets:Conqueroccludedpersonre-identificationwithoutalignment, in: Proceedings of the AAAI conference on artificial intelligence, pp. 1673–1681. Kumar Singh, K., Jae Lee, Y.,

  9. [9]

    3524–3533

    Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization, in: Proceedings of the IEEE international conference on computer vision, pp. 3524–3533. Li,S.,Sun,L.,Li,Q.,2023. Clip-reid:exploitingvision-languagemodelforimagere-identificationwithoutconcretetextlabels,in:Proceedingsof the AAAI conference on artificia...

  10. [10]

    Microsoft coco: Common objects in context, in: European conference on computer vision, Springer. pp. 740–755. Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,2021. Swintransformer:Hierarchicalvisiontransformerusingshiftedwindows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022. Miao, J., Wu, Y., Liu, P...

  11. [11]

    Pose-guided feature alignment for occluded person re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 542–551. Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G.,Askell,A.,Mishkin,P.,Clark,J.,etal.,2021.Learningtransferable visual models from natural language supervision, in: International ...

  12. [12]

    International journal of computer vision 115, 211–252

    Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252. Selvaraju,R.R.,Cogswell,M.,Das,A.,Vedantam,R.,Parikh,D.,Batra,D.,2017. Grad-cam:Visualexplanationsfromdeepnetworksviagradient- based localization, in: Proceedings of the IEEE international conference on computer vision, pp. 618–626. Song, Y., Liu, S.,

  13. [13]

    arXiv preprint arXiv:2401.07469

    A deep hierarchical feature sparse framework for occluded person re-identification. arXiv preprint arXiv:2401.07469 . :Preprint submitted to Elsevier Page 16 of 17 Sun, H., Chen, Z., Yan, S., Xu, L., 2019a. Mvp matching: A maximum-value perfect matching for mining hard samples, with application to person re-identification.2019IEEE/CVFInternationalConferen...

  14. [14]

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 7127–7136URL:https://api.semanticscholar.org/CorpusID:195823196

    Aanet: Attribute attention network for person re-identifications. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 7127–7136URL:https://api.semanticscholar.org/CorpusID:195823196. Wang,A.,Chen,H.,Lin,Z.,Han,J.,Ding,G.,2025a. Lsnet:Seelarge,focussmall,in:ProceedingsoftheComputerVisionandPatternRecognition Conference, pp. 9718–97...

  15. [15]

    IEEE Transactions on Multimedia 26, 8529–8542

    Feature completion transformer for occluded person re-identification. IEEE Transactions on Multimedia 26, 8529–8542. Wang, Z., Zhu, F., Tang, S., Zhao, R., He, L., Song, J., 2022b. Feature erasing and diffusion network for occluded person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4754–476...

  16. [16]

    arXiv preprint arXiv:2110.09408

    Hrformer: High-resolution transformer for dense prediction. arXiv preprint arXiv:2110.09408 . Yun,S.,Han,D.,Oh,S.J.,Chun,S.,Choe,J.,Yoo,Y.,2019. Cutmix:Regularizationstrategytotrainstrongclassifierswithlocalizablefeatures,in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032. Zhang, Z., Han, S., Liu, D., Ming, D.,

  17. [17]

    Zheng, K., Lan, C., Zeng, W., Liu, J., Zhang, Z., Zha, Z.J.,

    URL:https://doi.org/10.1016/j.neucom.2024.127442, doi:10.1016/j.neucom.2024.127442. Zheng, K., Lan, C., Zeng, W., Liu, J., Zhang, Z., Zha, Z.J.,

  18. [18]

    Person Re-identification Meets Image Search

    Person re-identification meets image search. arXiv preprint arXiv:1502.02171 . Zheng, Z., Zheng, L., Yang, Y.,

  19. [19]

    arXiv preprint arXiv:2508.04998

    Attribute guidance with inherent pseudo-label for occluded person re-identification. arXiv preprint arXiv:2508.04998 . Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.,

  20. [20]

    Occluded person re-identification, in: 2018 IEEE international conference on multimedia and expo (ICME), IEEE. pp. 1–6. :Preprint submitted to Elsevier Page 17 of 17