pith. sign in

arxiv: 1907.09659 · v1 · pith:NWKWLBE6new · submitted 2019-07-23 · 💻 cs.CV

Enhancing the Discriminative Feature Learning for Visible-Thermal Cross-Modality Person Re-Identification

Pith reviewed 2026-05-24 18:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords visible-thermal person re-identificationcross-modality discrepancyintra-modality variationsskip connectiondual-modality triplet losstwo-stream CNNdiscriminative feature learning
0
0 comments X

The pith

Skip connections for mid-level features plus a dual-modality triplet loss enhance discriminative learning in visible-thermal person re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets visible-thermal cross-modality person re-identification, a setting where visible cameras fail at night and thermal images must be matched to visible ones despite large appearance shifts. It introduces two minimal additions inside a two-stream CNN: skip connections that fold mid-level features into the final representation, and a dual-modality triplet loss that penalizes both cross-modality and within-modality distances at the same time. These changes are meant to produce person features that remain stable across modalities while still separating different identities. If the additions work, they would let surveillance systems keep matching people reliably across day and night without heavier models or extra data. Experiments on two public datasets report large gains over prior methods.

Core claim

A two-stream CNN equipped with skip connections that incorporate mid-level features and trained with a dual-modality triplet loss reduces both cross-modality discrepancy and intra-modality variations, yielding person features that are more discriminative and robust for visible-thermal re-identification.

What carries the argument

The EDFL method that adds skip connections for mid-level feature incorporation and a dual-modality triplet loss inside a two-stream CNN.

If this is right

  • Mid-level features passed via skip connections add robustness that high-level features alone do not provide across modalities.
  • The dual-modality triplet loss simultaneously shrinks distances between visible and thermal images of the same person and expands distances between different people within each modality.
  • A two-stream architecture produces shared features usable by both visible and thermal inputs.
  • The combined changes produce measurable accuracy lifts on existing visible-thermal benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on mid-level features suggests that identity cues useful across modalities sit at intermediate depths rather than only at the deepest layers.
  • The same pair of modifications could be tested on other cross-modal re-identification pairs such as RGB-infrared or visible-depth without changing the overall training recipe.
  • If the dual triplet loss proves decisive, future work could explore weighting the cross-modality and intra-modality terms separately on different datasets.

Load-bearing premise

These two lightweight changes alone will close the modality gap and variation problems without needing deeper redesigns or more training data.

What would settle it

Running the same network on a new visible-thermal dataset where accuracy gains shrink to within a few percent of the best prior method would show the enhancements are not generally sufficient.

Figures

Figures reproduced from arXiv: 1907.09659 by Haijun Liu, Jian Cheng.

Figure 1
Figure 1. Figure 1: Illustration of the visible-thermal images, from two datasets SYSU [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The location visualization of different stages of ResNet50 [7] model [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed EDFL framework for VT Re-ID. Two-stream CNN structure is adopted to extract person features, one stream for visible images and [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The performances of our proposed enhancing discriminative feature [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Existing person re-identification has achieved great progress in the visible domain, capturing all the person images with visible cameras. However, in a 24-hour intelligent surveillance system, the visible cameras may be noneffective at night. In this situation, thermal cameras are the best supplemental components, which capture images without depending on visible light. Therefore, in this paper, we investigate the visible-thermal cross-modality person re-identification (VT Re-ID) problem. In VT Re-ID, there are two knotty problems should be well handled, cross-modality discrepancy and intra-modality variations. To address these two issues, we propose focusing on enhancing the discriminative feature learning (EDFL) with two extreme simple means from two core aspects, (1) skip-connection for mid-level features incorporation to improve the person features with more discriminability and robustness, and (2) dual-modality triplet loss to guide the training procedures by simultaneously considering the cross-modality discrepancy and intra-modality variations. Additionally, the two-stream CNN structure is adopted to learn the multi-modality sharable person features. The experimental results on two datasets show that our proposed EDFL approach distinctly outperforms state-of-the-art methods by large margins, demonstrating the effectiveness of our EDFL to enhance the discriminative feature learning for VT Re-ID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes an EDFL method for visible-thermal cross-modality person re-identification that adopts a two-stream CNN backbone, adds skip-connections to incorporate mid-level features for improved discriminability, and introduces a dual-modality triplet loss to jointly address cross-modality discrepancy and intra-modality variations. The central claim, stated in the abstract and §4, is that these two simple enhancements enable the method to distinctly outperform state-of-the-art approaches by large margins on two datasets.

Significance. If the reported gains are reproducible and attributable to the proposed components rather than the backbone or training protocol, the work would be significant for 24-hour surveillance applications, as it suggests that lightweight architectural and loss modifications can mitigate modality gaps without complex models or extra data.

major comments (1)
  1. [§4] §4 (Experiments) and associated tables: no ablation studies are presented that isolate the contribution of the mid-level skip-connection or the dual-modality triplet loss (e.g., full EDFL vs. two-stream baseline with standard triplet loss). Without these controls, the claim that the large margins over SOTA are caused by the two proposed enhancements cannot be verified and remains load-bearing for the paper's central assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: no ablation studies are presented that isolate the contribution of the mid-level skip-connection or the dual-modality triplet loss (e.g., full EDFL vs. two-stream baseline with standard triplet loss). Without these controls, the claim that the large margins over SOTA are caused by the two proposed enhancements cannot be verified and remains load-bearing for the paper's central assertion.

    Authors: We agree that the manuscript would benefit from explicit ablation studies to isolate the contributions of the mid-level skip-connections and the dual-modality triplet loss. The current experiments focus on overall performance against state-of-the-art methods but do not include direct controls such as a two-stream baseline with standard triplet loss or variants without skip-connections. In the revision, we will add these ablation experiments to the tables in §4, which will allow verification that the reported gains are attributable to the proposed components rather than the backbone or training protocol alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an empirical method (two-stream CNN + mid-level skip connections + dual-modality triplet loss) for VT Re-ID and supports its claims solely via reported performance on two external datasets. No equations, derivations, or predictions are present that reduce to inputs by construction; no self-citations are invoked as load-bearing uniqueness theorems; and the approach does not rename known results or smuggle ansatzes. The derivation chain is therefore self-contained empirical evaluation rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; standard deep learning assumptions apply but are not detailed.

pith-pipeline@v0.9.0 · 5764 in / 1125 out tokens · 28896 ms · 2026-05-24T18:09:08.515496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Collective deep quantization for efficient cross-modal retrieval,

    Y . Cao, M. Long, J. Wang, and S. Liu, “Collective deep quantization for efficient cross-modal retrieval,” in AAAI, 2017

  2. [2]

    Multi-level factorisation net for person re-identification,

    X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in CVPR, 2018

  3. [3]

    Person re-identification by camera correlation aware feature augmentation,

    Y .-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, “Person re-identification by camera correlation aware feature augmentation,” IEEE TPAMI , vol. 40, no. 2, pp. 392–408, 2018

  4. [4]

    Towards cycle-consistent models for text and image retrieval,

    M. Cornia, L. Baraldi, H. R. Tavakoli, and R. Cucchiara, “Towards cycle-consistent models for text and image retrieval,” in ECCV, 2018, pp. 687–691

  5. [5]

    Cross-modality person re-identification with generative adversarial training

    P. Dai, R. Ji, H. Wang, Q. Wu, and Y . Huang, “Cross-modality person re-identification with generative adversarial training.” in IJCAI, 2018, pp. 677–683

  6. [6]

    Mutual component convolutional neural networks for heterogeneous face recognition,

    Z. Deng, X. Peng, Z. Li, and Y . Qiao, “Mutual component convolutional neural networks for heterogeneous face recognition,” IEEE TIP , 2019

  7. [7]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

  8. [8]

    Learning invariant deep represen- tation for nir-vis face recognition,

    R. He, X. Wu, Z. Sun, and T. Tan, “Learning invariant deep represen- tation for nir-vis face recognition,” in AAAI, 2017

  9. [9]

    In Defense of the Triplet Loss for Person Re-Identification

    A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737 , 2017

  10. [10]

    A systematic evaluation and benchmark for person re- identification: Features, metrics, and datasets,

    S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J. Radke, “A systematic evaluation and benchmark for person re- identification: Features, metrics, and datasets,” IEEE TPAMI, 2018

  11. [11]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

  12. [12]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012, pp. 1097– 1105

  13. [13]

    Harmonious attention network for person re-identification,

    W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in CVPR, 2018, pp. 2285–2294

  14. [14]

    Gallery based k-reciprocal-like re-ranking for heavy cross-camera discrepancy in person re-identification,

    H. Liu and J. Cheng, “Gallery based k-reciprocal-like re-ranking for heavy cross-camera discrepancy in person re-identification,” Neurocom- puting, vol. 333, pp. 64–75, 2019

  15. [15]

    Hydraplus-net: Attentive deep features for pedestrian analysis,

    X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang, “Hydraplus-net: Attentive deep features for pedestrian analysis,” in ICCV, 2017, pp. 350–359

  16. [16]

    Person recognition system based on a combination of body images from visible light and thermal cameras,

    D. Nguyen, H. Hong, K. Kim, and K. Park, “Person recognition system based on a combination of body images from visible light and thermal cameras,” Sensors, vol. 17, no. 3, p. 605, 2017

  17. [17]

    Deep heterogeneous face recognition networks based on cross-modal distillation and an equitable distance metric,

    C. Reale, H. Lee, and H. Kwon, “Deep heterogeneous face recognition networks based on cross-modal distillation and an equitable distance metric,” in ICCV Workshops, 2017, pp. 32–38

  18. [18]

    Facenet: A unified embed- ding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed- ding for face recognition and clustering,” in CVPR, 2015, pp. 815–823

  19. [19]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626

  20. [20]

    Pose-driven deep convolutional model for person re-identification,

    C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in ICCV, 2017, pp. 3980–3989

  21. [21]

    Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),

    Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in ECCV, 2018, pp. 501–518

  22. [22]

    Inception-v4, inception-resnet and the impact of residual connections on learning

    C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in AAAI, vol. 4, 2017, p. 12

  23. [23]

    Mancs: A multi-task attentional network with curriculum sampling for person re- identification,

    C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs: A multi-task attentional network with curriculum sampling for person re- identification,” in ECCV, 2018, pp. 384–400

  24. [24]

    Learning discriminative features with multiple granularities for person re-identification,

    G. Wang, Y . Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” ACM MM, 2018

  25. [25]

    Learning two-branch neural networks for image-text matching tasks,

    L. Wang, Y . Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 2, pp. 394–407, 2019

  26. [26]

    Learn- ing to reduce dual-level discrepancy for infrared-visible person re- identification,

    Z. Wang, Z. Wang, Y . Zheng, Y .-Y . Chuang, and S. Satoh, “Learn- ing to reduce dual-level discrepancy for infrared-visible person re- identification,” in CVPR, 2019, pp. 618–626

  27. [27]

    Rgb-infrared cross-modality person re-identification,

    A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross-modality person re-identification,” in ICCV, 2017, pp. 5380–5389

  28. [28]

    Coupled deep learning for heterogeneous face recognition,

    X. Wu, L. Song, R. He, and T. Tan, “Coupled deep learning for heterogeneous face recognition,” in AAAI, 2018

  29. [29]

    Hierarchical discriminative learning for visible thermal person re-identification,

    M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminative learning for visible thermal person re-identification,” in AAAI, 2018

  30. [30]

    Visible thermal person re-identification via dual-constrained top-ranking

    M. Ye, Z. Wang, X. Lan, and P. C. Yuen, “Visible thermal person re-identification via dual-constrained top-ranking.” in IJCAI, 2018, pp. 1092–1099

  31. [31]

    Visualizing and understanding convolu- tional networks,

    M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833

  32. [32]

    Person Re-identification: Past, Present and Future

    L. Zheng, Y . Yang, and A. G. Hauptmann, “Person re-identification: Past, present and future,” arXiv preprint arXiv:1610.02984 , 2016