VRSTC: Occlusion-Free Video Person Re-Identification

Bingpeng Ma; Hong Chang; Ruibing Hou; Shiguang Shan; Xilin Chen; Xinqian Gu

arxiv: 1907.08427 · v1 · pith:WXM632A5new · submitted 2019-07-19 · 💻 cs.CV

VRSTC: Occlusion-Free Video Person Re-Identification

Ruibing Hou , Bingpeng Ma , Hong Chang , Xinqian Gu , Shiguang Shan , Xilin Chen This is my paper

Pith reviewed 2026-05-24 19:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords video person re-identificationpartial occlusionspatio-temporal completionSTCnetappearance recoverysurveillancepedestrian matchingocclusion handling

0 comments

The pith

STCnet recovers occluded pedestrian appearances using spatial structure and temporal patterns to improve video re-identification accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Spatio-Temporal Completion network to address partial occlusion in video person re-identification. It recovers the appearance of occluded body parts by using the spatial layout within each frame to infer missing sections from visible ones and the sequence's temporal patterns to generate plausible content. This allows the system to use the full completed person appearance rather than ignoring occluded frames. The completed videos are then fed into a standard re-ID network to form the VRSTC framework. If successful, this leads to higher accuracy on datasets where occlusion is common in surveillance videos.

Core claim

The authors claim that a network called STCnet can explicitly recover the appearance of occluded parts in video sequences by combining spatial prediction within frames and temporal generation across frames, and that integrating this with a re-ID network produces a framework robust to partial occlusion that outperforms previous methods on three video re-ID databases.

What carries the argument

Spatio-Temporal Completion network (STCnet) that predicts occluded body parts from unoccluded parts using spatial structure in frames and temporal patterns in sequences.

If this is right

The VRSTC framework combines STCnet with a re-ID network for occlusion-robust identification.
STCnet enables leveraging both recovered and unoccluded parts for matching.
Performance on three challenging video re-ID databases exceeds state-of-the-art methods.
Discarding occluded frames is avoided in favor of completion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might extend to real-time surveillance systems where occlusion is frequent.
Further improvements could come from combining with advanced generative models for higher quality recovery.
Similar completion techniques could apply to other occluded video tasks like action recognition.
The approach assumes accurate prediction is possible without introducing artifacts that harm re-ID.

Load-bearing premise

The spatial structure of pedestrian frames and temporal patterns in sequences can be used to accurately predict and generate the appearance of occluded body parts.

What would settle it

Observing that re-identification accuracy decreases or stays the same when using the STCnet-recovered frames compared to simply discarding occluded frames on the tested databases.

Figures

Figures reproduced from arXiv: 1907.08427 by Bingpeng Ma, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen, Xinqian Gu.

**Figure 1.** Figure 1: Overview of STCnet. The spatial structure generator takes the masked frame as input and outputs the generated frame. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the temporal attention layer. For simplic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of VRSTC. 4.1. Similarity Scoring The works [18, 43, 33, 3] use the attention mechanism to locate the occluded frames. These approaches usually construct a subnetwork to predict the weight of each frame in video. However, it is difficult for the subnetwork to automatically assign low weights to the occluded frames, as there is no direct supervision for the weights. Considering the concern above, … view at source ↗

**Figure 4.** Figure 4: The rank-1 and mAP on DukeMTMC-VideoReID (a) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Scores of similarity scoring mechanism from one se [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visual examples of STCnet. From top to bottom: (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Video person re-identification (re-ID) plays an important role in surveillance video analysis. However, the performance of video re-ID degenerates severely under partial occlusion. In this paper, we propose a novel network, called Spatio-Temporal Completion network (STCnet), to explicitly handle partial occlusion problem. Different from most previous works that discard the occluded frames, STCnet can recover the appearance of the occluded parts. For one thing, the spatial structure of a pedestrian frame can be used to predict the occluded body parts from the unoccluded body parts of this frame. For another, the temporal patterns of pedestrian sequence provide important clues to generate the contents of occluded parts. With the Spatio-temporal information, STCnet can recover the appearance for the occluded parts, which could be leveraged with those unoccluded parts for more accurate video re-ID. By combining a re-ID network with STCnet, a video re-ID framework robust to partial occlusion (VRSTC) is proposed. Experiments on three challenging video re-ID databases demonstrate that the proposed approach outperforms the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STCnet adds an explicit completion step to recover occluded pedestrian parts via spatial structure and temporal patterns, but the abstract gives no numbers or ablations so the actual payoff is hard to judge.

read the letter

The core move here is replacing frame discard with an explicit spatio-temporal completion network that tries to fill in missing body parts from the visible structure in the same frame plus sequence patterns. That is a clean shift from the usual approach in video re-ID and directly targets a frequent failure mode in surveillance footage. The paper then folds the completed features into a standard re-ID pipeline, which is a logical way to test whether the recovery helps downstream accuracy. Experiments are reported on three video re-ID datasets with a claim of beating prior work, so the evaluation design matches the goal. The mechanism itself is stated plainly without circular definitions or hidden fitting. The main limitation visible from the abstract is the absence of any quantitative results, ablation tables, or error analysis. Without those it is difficult to tell how much the completion module actually contributes versus other architecture choices or training tricks. A secondary question is how well the spatial-plus-temporal prediction holds when occlusion is heavy; the paper would need to show recovered frames or failure cases to make that clear. This is aimed at people already working on video person re-ID who want a method-level handle on occlusion rather than a broad new direction. If the full paper supplies the missing numbers, ablations, and some qualitative checks, it is worth sending out for review so the community can see whether the recovery step delivers measurable value.

Referee Report

1 major / 0 minor

Summary. The paper proposes a Spatio-Temporal Completion network (STCnet) to explicitly recover the appearance of occluded body parts in video person re-identification. It leverages intra-frame spatial structure to predict occluded parts from unoccluded ones and inter-frame temporal patterns for content generation. The recovered features are combined with a standard re-ID network to form the VRSTC framework, which is reported to outperform prior methods on three video re-ID datasets.

Significance. If the empirical gains hold under rigorous validation, the work would be significant for surveillance applications where partial occlusion is common. Explicitly completing occluded regions via spatio-temporal cues, rather than discarding frames, offers a direct mechanism that could improve robustness; the end-to-end re-ID accuracy evaluation is the appropriate test of the mechanism.

major comments (1)

[Abstract] Abstract: the claim that the approach 'outperforms the state-of-the-art' on three databases is made without any quantitative results, ablation studies, or error analysis; this is load-bearing for the central claim and prevents verification of whether STCnet's recovery actually improves re-ID accuracy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding the abstract below and will revise the manuscript accordingly to strengthen the presentation of our central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the approach 'outperforms the state-of-the-art' on three databases is made without any quantitative results, ablation studies, or error analysis; this is load-bearing for the central claim and prevents verification of whether STCnet's recovery actually improves re-ID accuracy.

Authors: We agree that the abstract would benefit from including key quantitative results to make the performance claims immediately verifiable. In the revised version, we will update the abstract to report the specific rank-1 and mAP improvements achieved by VRSTC over prior state-of-the-art methods on the three video re-ID datasets (e.g., DukeMTMC-VideoReID, MARS, and iLIDS-VID). The full experimental results, including ablation studies on the contribution of spatio-temporal completion and error analysis, are already detailed in Sections 4 and 5 of the manuscript; adding summary numbers to the abstract will allow readers to assess the impact of STCnet without first reading the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper proposes STCnet, a network architecture that explicitly recovers occluded appearance by combining intra-frame spatial structure with inter-frame temporal patterns, then combines it with a re-ID network for VRSTC. The central claim is implemented as a trainable model and validated by accuracy gains on three external datasets. No equations, fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text; the mechanism is an architectural ansatz tested empirically rather than derived from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or model specifications; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5734 in / 993 out tokens · 20975 ms · 2026-05-24T19:13:17.009896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

[1]

Barnes, E

C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Gold- man. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 28(3):24, 2009. 2

work page 2009
[2]

R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior. The relation between the roc curve and the cmc. In AUTOID, pages 15–20, 2005. 5

work page 2005
[3]

D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang. Video person re-identiﬁcation with competitive snippet-similarity aggre- gation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018. 1, 5, 7, 8

work page 2018
[4]

D. A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units. arXiv preprint arXiv:1511.07289, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Dehghan, S

A. Dehghan, S. M. Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique prob- lem for multiple object tracking. In CVPR, pages 4091– 4099, 2015. 5

work page 2015
[6]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Trans. Pattern Anal. Mach. Intell. , 32(9):1627–1645, 2010. 5

work page 2010
[7]

Goodfellowa, J

I. Goodfellowa, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial nets. In NIPS, pages 2672–2680, 2014. 2

work page 2014
[8]

Hays and A

J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) , 26(3):4, 2007. 2

work page 2007
[9]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 4

work page 2016
[10]

In Defense of the Triplet Loss for Person Re-Identification

A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Iizuka, E

S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017. 2, 3

work page 2017
[12]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Koestinger, M

M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, pages 2288–2295, 2012. 8

work page 2012
[14]

D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identiﬁcation. In CVPR, pages 384–393, 2017. 2

work page 2017
[15]

S. Li, S. Bak, P. Carr, C. Hetang, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identiﬁcation. In CVPR, pages 369–378, 2018. 1, 2, 7, 8

work page 2018
[16]

Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith. Learning locally-adaptive decision functions for per- son veriﬁcation. In CVPR, pages 3610–3617, 2013. 2, 8

work page 2013
[17]

K. Liu, B. Ma, W. Zhang, and R. Huang. A spatiotempo- ral appearance representation for video-based pedestrian re- identiﬁcation. In ICCV, pages 3810–3818, 2015. 2, 8

work page 2015
[18]

Y . Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, pages 4694–4703, 2017. 1, 2, 5, 7, 8

work page 2017
[19]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015. 3

work page 2015
[20]

L. Ma, X. Yang, and D. Tao. Person re-identiﬁcation over camera networks using multi-task distance metric learning. IEEE Transactions on Image Processing, 23(4):3656–3670,

work page
[21]

McLaughlin, J

N. McLaughlin, J. M. del Rincon, and P. C. Miller. Re- current convolutional network for video-based person re- identiﬁcation. In CVPR, pages 1325–1334, 2016. 1, 2, 8

work page 2016
[22]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tomatic differentiation in pytorch. In NIPS workshop, 2017. 5

work page 2017
[23]

Pathak, P

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016. 2

work page 2016
[24]

Pedagadi, J

S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local ﬁsher discriminant analysis for pedestrian reidentiﬁcation. In CVPR, pages 3318–3325, 2013. 8

work page 2013
[25]

Radford, L

A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016. 4

work page 2016
[26]

Ristani, F

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multi- camera tracking. In ECCV Workshop, 2016. 5

work page 2016
[27]

G. Song, B. Leng, Y . Liu, C. Hetang, and S. Cai. Region- based quality estimation network for large-scale person re- identiﬁcation. In AAAI, 2018. 2, 7, 8

work page 2018
[28]

Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with reﬁned part pooling (and a strong convolutional baseline). In ECCV, 2018. 2, 4

work page 2018
[29]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 5, 6

work page 2018
[30]

L. Wu, C. Shen, and A. V . D. Hengel. Deep recurrent con- volutional networks for video-based person re-identiﬁcation: An end-to-end approach. arXiv preprint arXiv:1606.01609,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Y . Wu, Y . Lin, X. Dong, Y . Yan, W. Quyang, and Y . Yang. Exploit the unknown gradually: One-shot video-based per- son re-identiﬁcation by stepwise learning. In CVPR, 2018. 7, 8

work page 2018
[32]

Y . Wu, J. Qiu, J. Takamatsu, and T. Ogasawara. Temporal- enhanced convolutional network for person re-identiﬁcation. In AAAI, 2018. 1, 2

work page 2018
[33]

S. Xu, Y . Cheng, K. Gu, Y . Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identiﬁcation. In ICCV, pages 4743–4752, 2017. 1, 2, 5, 8

work page 2017
[34]

J. You, A. Wu, X. Li, and W. Zheng. Top-push video-based person re-identiﬁcation. In CVPR, pages 1345–1353, 2016. 8

work page 2016
[35]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Zhang, N

J. Zhang, N. Wang, and L. Zhang. Multi-shot pedestrian re-identiﬁcation via sequential decision making. In CVPR,

work page
[37]

Zhang, T

L. Zhang, T. Xiang, and S. Gong. Learning a discrimina- tive null space for person re-identiﬁcation. In CVPR, pages 1239–1248, 2016. 2

work page 2016
[38]

H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: person re-identiﬁcation with hu- man body region guided feature decomposition and fusion. In CVPR, pages 1077–1085, 2017. 2

work page 2017
[39]

Zheng, Z

L. Zheng, Z. Bie, Y . Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identiﬁcation. In ECCV, pages 868–884, 2016. 8

work page 2016
[40]

Zheng, L

L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identiﬁcation: A benchmark. In ICCV, pages 1116–1124, 2015. 5

work page 2015
[41]

Zheng, S

W. Zheng, S. Gong, and T. Xiang. Person re-identiﬁcation by probabilistic relative distance comparison. In CVPR, pages 649–656, 2011. 2

work page 2011
[42]

Zhong, L

Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identiﬁcation with k-reciprocal encoding. In CVPR, pages 3652–3661, 2017. 7

work page 2017
[43]

Z. Zhou, Y . Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identiﬁcation. In CVPR, pages 6776–6785, 2017. 1, 2, 5, 8

work page 2017

[1] [1]

Barnes, E

C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Gold- man. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 28(3):24, 2009. 2

work page 2009

[2] [2]

R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior. The relation between the roc curve and the cmc. In AUTOID, pages 15–20, 2005. 5

work page 2005

[3] [3]

D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang. Video person re-identiﬁcation with competitive snippet-similarity aggre- gation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018. 1, 5, 7, 8

work page 2018

[4] [4]

D. A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units. arXiv preprint arXiv:1511.07289, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Dehghan, S

A. Dehghan, S. M. Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique prob- lem for multiple object tracking. In CVPR, pages 4091– 4099, 2015. 5

work page 2015

[6] [6]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Trans. Pattern Anal. Mach. Intell. , 32(9):1627–1645, 2010. 5

work page 2010

[7] [7]

Goodfellowa, J

I. Goodfellowa, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial nets. In NIPS, pages 2672–2680, 2014. 2

work page 2014

[8] [8]

Hays and A

J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) , 26(3):4, 2007. 2

work page 2007

[9] [9]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 4

work page 2016

[10] [10]

In Defense of the Triplet Loss for Person Re-Identification

A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Iizuka, E

S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017. 2, 3

work page 2017

[12] [12]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Koestinger, M

M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, pages 2288–2295, 2012. 8

work page 2012

[14] [14]

D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identiﬁcation. In CVPR, pages 384–393, 2017. 2

work page 2017

[15] [15]

S. Li, S. Bak, P. Carr, C. Hetang, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identiﬁcation. In CVPR, pages 369–378, 2018. 1, 2, 7, 8

work page 2018

[16] [16]

Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith. Learning locally-adaptive decision functions for per- son veriﬁcation. In CVPR, pages 3610–3617, 2013. 2, 8

work page 2013

[17] [17]

K. Liu, B. Ma, W. Zhang, and R. Huang. A spatiotempo- ral appearance representation for video-based pedestrian re- identiﬁcation. In ICCV, pages 3810–3818, 2015. 2, 8

work page 2015

[18] [18]

Y . Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, pages 4694–4703, 2017. 1, 2, 5, 7, 8

work page 2017

[19] [19]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015. 3

work page 2015

[20] [20]

L. Ma, X. Yang, and D. Tao. Person re-identiﬁcation over camera networks using multi-task distance metric learning. IEEE Transactions on Image Processing, 23(4):3656–3670,

work page

[21] [21]

McLaughlin, J

N. McLaughlin, J. M. del Rincon, and P. C. Miller. Re- current convolutional network for video-based person re- identiﬁcation. In CVPR, pages 1325–1334, 2016. 1, 2, 8

work page 2016

[22] [22]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Au- tomatic differentiation in pytorch. In NIPS workshop, 2017. 5

work page 2017

[23] [23]

Pathak, P

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016. 2

work page 2016

[24] [24]

Pedagadi, J

S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local ﬁsher discriminant analysis for pedestrian reidentiﬁcation. In CVPR, pages 3318–3325, 2013. 8

work page 2013

[25] [25]

Radford, L

A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016. 4

work page 2016

[26] [26]

Ristani, F

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multi- camera tracking. In ECCV Workshop, 2016. 5

work page 2016

[27] [27]

G. Song, B. Leng, Y . Liu, C. Hetang, and S. Cai. Region- based quality estimation network for large-scale person re- identiﬁcation. In AAAI, 2018. 2, 7, 8

work page 2018

[28] [28]

Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with reﬁned part pooling (and a strong convolutional baseline). In ECCV, 2018. 2, 4

work page 2018

[29] [29]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 5, 6

work page 2018

[30] [30]

L. Wu, C. Shen, and A. V . D. Hengel. Deep recurrent con- volutional networks for video-based person re-identiﬁcation: An end-to-end approach. arXiv preprint arXiv:1606.01609,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Y . Wu, Y . Lin, X. Dong, Y . Yan, W. Quyang, and Y . Yang. Exploit the unknown gradually: One-shot video-based per- son re-identiﬁcation by stepwise learning. In CVPR, 2018. 7, 8

work page 2018

[32] [32]

Y . Wu, J. Qiu, J. Takamatsu, and T. Ogasawara. Temporal- enhanced convolutional network for person re-identiﬁcation. In AAAI, 2018. 1, 2

work page 2018

[33] [33]

S. Xu, Y . Cheng, K. Gu, Y . Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identiﬁcation. In ICCV, pages 4743–4752, 2017. 1, 2, 5, 8

work page 2017

[34] [34]

J. You, A. Wu, X. Li, and W. Zheng. Top-push video-based person re-identiﬁcation. In CVPR, pages 1345–1353, 2016. 8

work page 2016

[35] [35]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Zhang, N

J. Zhang, N. Wang, and L. Zhang. Multi-shot pedestrian re-identiﬁcation via sequential decision making. In CVPR,

work page

[37] [37]

Zhang, T

L. Zhang, T. Xiang, and S. Gong. Learning a discrimina- tive null space for person re-identiﬁcation. In CVPR, pages 1239–1248, 2016. 2

work page 2016

[38] [38]

H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: person re-identiﬁcation with hu- man body region guided feature decomposition and fusion. In CVPR, pages 1077–1085, 2017. 2

work page 2017

[39] [39]

Zheng, Z

L. Zheng, Z. Bie, Y . Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identiﬁcation. In ECCV, pages 868–884, 2016. 8

work page 2016

[40] [40]

Zheng, L

L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identiﬁcation: A benchmark. In ICCV, pages 1116–1124, 2015. 5

work page 2015

[41] [41]

Zheng, S

W. Zheng, S. Gong, and T. Xiang. Person re-identiﬁcation by probabilistic relative distance comparison. In CVPR, pages 649–656, 2011. 2

work page 2011

[42] [42]

Zhong, L

Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identiﬁcation with k-reciprocal encoding. In CVPR, pages 3652–3661, 2017. 7

work page 2017

[43] [43]

Z. Zhou, Y . Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identiﬁcation. In CVPR, pages 6776–6785, 2017. 1, 2, 5, 8

work page 2017