pith. sign in

arxiv: 1907.08427 · v1 · pith:WXM632A5new · submitted 2019-07-19 · 💻 cs.CV

VRSTC: Occlusion-Free Video Person Re-Identification

Pith reviewed 2026-05-24 19:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords video person re-identificationpartial occlusionspatio-temporal completionSTCnetappearance recoverysurveillancepedestrian matchingocclusion handling
0
0 comments X

The pith

STCnet recovers occluded pedestrian appearances using spatial structure and temporal patterns to improve video re-identification accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Spatio-Temporal Completion network to address partial occlusion in video person re-identification. It recovers the appearance of occluded body parts by using the spatial layout within each frame to infer missing sections from visible ones and the sequence's temporal patterns to generate plausible content. This allows the system to use the full completed person appearance rather than ignoring occluded frames. The completed videos are then fed into a standard re-ID network to form the VRSTC framework. If successful, this leads to higher accuracy on datasets where occlusion is common in surveillance videos.

Core claim

The authors claim that a network called STCnet can explicitly recover the appearance of occluded parts in video sequences by combining spatial prediction within frames and temporal generation across frames, and that integrating this with a re-ID network produces a framework robust to partial occlusion that outperforms previous methods on three video re-ID databases.

What carries the argument

Spatio-Temporal Completion network (STCnet) that predicts occluded body parts from unoccluded parts using spatial structure in frames and temporal patterns in sequences.

If this is right

  • The VRSTC framework combines STCnet with a re-ID network for occlusion-robust identification.
  • STCnet enables leveraging both recovered and unoccluded parts for matching.
  • Performance on three challenging video re-ID databases exceeds state-of-the-art methods.
  • Discarding occluded frames is avoided in favor of completion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might extend to real-time surveillance systems where occlusion is frequent.
  • Further improvements could come from combining with advanced generative models for higher quality recovery.
  • Similar completion techniques could apply to other occluded video tasks like action recognition.
  • The approach assumes accurate prediction is possible without introducing artifacts that harm re-ID.

Load-bearing premise

The spatial structure of pedestrian frames and temporal patterns in sequences can be used to accurately predict and generate the appearance of occluded body parts.

What would settle it

Observing that re-identification accuracy decreases or stays the same when using the STCnet-recovered frames compared to simply discarding occluded frames on the tested databases.

Figures

Figures reproduced from arXiv: 1907.08427 by Bingpeng Ma, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen, Xinqian Gu.

Figure 1
Figure 1. Figure 1: Overview of STCnet. The spatial structure generator takes the masked frame as input and outputs the generated frame. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the temporal attention layer. For simplic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of VRSTC. 4.1. Similarity Scoring The works [18, 43, 33, 3] use the attention mechanism to locate the occluded frames. These approaches usually construct a subnetwork to predict the weight of each frame in video. However, it is difficult for the subnetwork to au￾tomatically assign low weights to the occluded frames, as there is no direct supervision for the weights. Considering the concern above, … view at source ↗
Figure 4
Figure 4. Figure 4: The rank-1 and mAP on DukeMTMC-VideoReID (a) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scores of similarity scoring mechanism from one se [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual examples of STCnet. From top to bottom: (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Video person re-identification (re-ID) plays an important role in surveillance video analysis. However, the performance of video re-ID degenerates severely under partial occlusion. In this paper, we propose a novel network, called Spatio-Temporal Completion network (STCnet), to explicitly handle partial occlusion problem. Different from most previous works that discard the occluded frames, STCnet can recover the appearance of the occluded parts. For one thing, the spatial structure of a pedestrian frame can be used to predict the occluded body parts from the unoccluded body parts of this frame. For another, the temporal patterns of pedestrian sequence provide important clues to generate the contents of occluded parts. With the Spatio-temporal information, STCnet can recover the appearance for the occluded parts, which could be leveraged with those unoccluded parts for more accurate video re-ID. By combining a re-ID network with STCnet, a video re-ID framework robust to partial occlusion (VRSTC) is proposed. Experiments on three challenging video re-ID databases demonstrate that the proposed approach outperforms the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a Spatio-Temporal Completion network (STCnet) to explicitly recover the appearance of occluded body parts in video person re-identification. It leverages intra-frame spatial structure to predict occluded parts from unoccluded ones and inter-frame temporal patterns for content generation. The recovered features are combined with a standard re-ID network to form the VRSTC framework, which is reported to outperform prior methods on three video re-ID datasets.

Significance. If the empirical gains hold under rigorous validation, the work would be significant for surveillance applications where partial occlusion is common. Explicitly completing occluded regions via spatio-temporal cues, rather than discarding frames, offers a direct mechanism that could improve robustness; the end-to-end re-ID accuracy evaluation is the appropriate test of the mechanism.

major comments (1)
  1. [Abstract] Abstract: the claim that the approach 'outperforms the state-of-the-art' on three databases is made without any quantitative results, ablation studies, or error analysis; this is load-bearing for the central claim and prevents verification of whether STCnet's recovery actually improves re-ID accuracy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding the abstract below and will revise the manuscript accordingly to strengthen the presentation of our central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the approach 'outperforms the state-of-the-art' on three databases is made without any quantitative results, ablation studies, or error analysis; this is load-bearing for the central claim and prevents verification of whether STCnet's recovery actually improves re-ID accuracy.

    Authors: We agree that the abstract would benefit from including key quantitative results to make the performance claims immediately verifiable. In the revised version, we will update the abstract to report the specific rank-1 and mAP improvements achieved by VRSTC over prior state-of-the-art methods on the three video re-ID datasets (e.g., DukeMTMC-VideoReID, MARS, and iLIDS-VID). The full experimental results, including ablation studies on the contribution of spatio-temporal completion and error analysis, are already detailed in Sections 4 and 5 of the manuscript; adding summary numbers to the abstract will allow readers to assess the impact of STCnet without first reading the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper proposes STCnet, a network architecture that explicitly recovers occluded appearance by combining intra-frame spatial structure with inter-frame temporal patterns, then combines it with a re-ID network for VRSTC. The central claim is implemented as a trainable model and validated by accuracy gains on three external datasets. No equations, fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text; the mechanism is an architectural ansatz tested empirically rather than derived from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or model specifications; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5734 in / 993 out tokens · 20975 ms · 2026-05-24T19:13:17.009896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

  1. [1]

    Barnes, E

    C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Gold- man. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 28(3):24, 2009. 2

  2. [2]

    R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior. The relation between the roc curve and the cmc. In AUTOID, pages 15–20, 2005. 5

  3. [3]

    D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang. Video person re-identification with competitive snippet-similarity aggre- gation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018. 1, 5, 7, 8

  4. [4]

    D. A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units. arXiv preprint arXiv:1511.07289, 2015. 3

  5. [5]

    Dehghan, S

    A. Dehghan, S. M. Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique prob- lem for multiple object tracking. In CVPR, pages 4091– 4099, 2015. 5

  6. [6]

    P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Trans. Pattern Anal. Mach. Intell. , 32(9):1627–1645, 2010. 5

  7. [7]

    Goodfellowa, J

    I. Goodfellowa, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial nets. In NIPS, pages 2672–2680, 2014. 2

  8. [8]

    Hays and A

    J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) , 26(3):4, 2007. 2

  9. [9]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 4

  10. [10]

    In Defense of the Triplet Loss for Person Re-Identification

    A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 7

  11. [11]

    Iizuka, E

    S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017. 2, 3

  12. [12]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

  13. [13]

    Koestinger, M

    M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, pages 2288–2295, 2012. 8

  14. [14]

    D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, pages 384–393, 2017. 2

  15. [15]

    S. Li, S. Bak, P. Carr, C. Hetang, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR, pages 369–378, 2018. 1, 2, 7, 8

  16. [16]

    Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith. Learning locally-adaptive decision functions for per- son verification. In CVPR, pages 3610–3617, 2013. 2, 8

  17. [17]

    K. Liu, B. Ma, W. Zhang, and R. Huang. A spatiotempo- ral appearance representation for video-based pedestrian re- identification. In ICCV, pages 3810–3818, 2015. 2, 8

  18. [18]

    Y . Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, pages 4694–4703, 2017. 1, 2, 5, 7, 8

  19. [19]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015. 3

  20. [20]

    L. Ma, X. Yang, and D. Tao. Person re-identification over camera networks using multi-task distance metric learning. IEEE Transactions on Image Processing, 23(4):3656–3670,

  21. [21]

    McLaughlin, J

    N. McLaughlin, J. M. del Rincon, and P. C. Miller. Re- current convolutional network for video-based person re- identification. In CVPR, pages 1325–1334, 2016. 1, 2, 8

  22. [22]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tomatic differentiation in pytorch. In NIPS workshop, 2017. 5

  23. [23]

    Pathak, P

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016. 2

  24. [24]

    Pedagadi, J

    S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local fisher discriminant analysis for pedestrian reidentification. In CVPR, pages 3318–3325, 2013. 8

  25. [25]

    Radford, L

    A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016. 4

  26. [26]

    Ristani, F

    E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multi- camera tracking. In ECCV Workshop, 2016. 5

  27. [27]

    G. Song, B. Leng, Y . Liu, C. Hetang, and S. Cai. Region- based quality estimation network for large-scale person re- identification. In AAAI, 2018. 2, 7, 8

  28. [28]

    Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018. 2, 4

  29. [29]

    X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 5, 6

  30. [30]

    L. Wu, C. Shen, and A. V . D. Hengel. Deep recurrent con- volutional networks for video-based person re-identification: An end-to-end approach. arXiv preprint arXiv:1606.01609,

  31. [31]

    Y . Wu, Y . Lin, X. Dong, Y . Yan, W. Quyang, and Y . Yang. Exploit the unknown gradually: One-shot video-based per- son re-identification by stepwise learning. In CVPR, 2018. 7, 8

  32. [32]

    Y . Wu, J. Qiu, J. Takamatsu, and T. Ogasawara. Temporal- enhanced convolutional network for person re-identification. In AAAI, 2018. 1, 2

  33. [33]

    S. Xu, Y . Cheng, K. Gu, Y . Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, pages 4743–4752, 2017. 1, 2, 5, 8

  34. [34]

    J. You, A. Wu, X. Li, and W. Zheng. Top-push video-based person re-identification. In CVPR, pages 1345–1353, 2016. 8

  35. [35]

    Multi-Scale Context Aggregation by Dilated Convolutions

    F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2, 3

  36. [36]

    Zhang, N

    J. Zhang, N. Wang, and L. Zhang. Multi-shot pedestrian re-identification via sequential decision making. In CVPR,

  37. [37]

    Zhang, T

    L. Zhang, T. Xiang, and S. Gong. Learning a discrimina- tive null space for person re-identification. In CVPR, pages 1239–1248, 2016. 2

  38. [38]

    H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: person re-identification with hu- man body region guided feature decomposition and fusion. In CVPR, pages 1077–1085, 2017. 2

  39. [39]

    Zheng, Z

    L. Zheng, Z. Bie, Y . Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868–884, 2016. 8

  40. [40]

    Zheng, L

    L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015. 5

  41. [41]

    Zheng, S

    W. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In CVPR, pages 649–656, 2011. 2

  42. [42]

    Zhong, L

    Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pages 3652–3661, 2017. 7

  43. [43]

    Z. Zhou, Y . Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In CVPR, pages 6776–6785, 2017. 1, 2, 5, 8