MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

Dong Li; Jinhua Wu; Li Zhang; Yunda Sun; Zhulin Zhang

arxiv: 1907.11366 · v1 · pith:H2DTIOYCnew · submitted 2019-07-26 · 💻 cs.CV

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

Zhulin Zhang , Dong Li , Jinhua Wu , Yunda Sun , Li Zhang This is my paper

Pith reviewed 2026-05-24 16:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords baggage re-identificationmulti-view datasetSiamese networkairport imagingobject re-identificationpose variationsurface material labelslarge-scale dataset

0 comments

The pith

MVB is the first large-scale public dataset for baggage re-identification, with 4519 identities captured via multi-view cameras across two real airports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MVB as a new dataset tailored to baggage re-identification, which differs from person re-identification in its high inter-class similarity and sensitivity to real-world imaging changes. It supplies 4519 baggage identities and 22660 annotated images plus surface material labels, all collected with a multi-view camera setup meant to capture more complete surface information despite pose shifts and occlusions. A merged Siamese network serves as the baseline model whose performance is measured on the data. A sympathetic reader would care because baggage tracking in airports currently lacks large, realistic training resources that reflect actual environmental differences between collection sites.

Core claim

The authors release MVB, the first publicly available large-scale dataset for baggage ReID containing 4519 identities and 22660 images together with surface material labels; all images come from a specially-designed multi-view camera system deployed in two real airport environments that differ markedly in imaging factors, with the system intended to obtain baggage surface 3D information as completely as possible in the presence of pose variation and occlusion. They further introduce a merged Siamese network baseline and report its evaluation results on the dataset.

What carries the argument

The specially-designed multi-view camera system that captures baggage from multiple angles to assemble more complete surface information despite pose and occlusion.

If this is right

Re-identification models can now be trained and tested on baggage data that includes both inter-class similarity and cross-environment imaging differences.
Surface material labels become available as an auxiliary signal for distinguishing visually similar items.
Benchmarks exist for merged Siamese architectures on this specific object category.
The dataset supports evaluation of methods that must generalize across two distinct real-world capture conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-view capture strategy could be applied to other rigid objects that must be tracked across camera networks.
Material labels might support downstream tasks such as automated sorting by surface type in logistics settings.
Performance gaps between the baseline and future models on MVB would quantify how much the baggage domain still differs from person re-identification.
The two-environment design supplies a ready test bed for domain-adaptation techniques without needing new data collection.

Load-bearing premise

The multi-view camera system actually gathers enough additional surface information to overcome the pose and occlusion problems that occur in the two airport environments.

What would settle it

A controlled experiment in which single-view subsets of MVB yield re-identification accuracy equal to or higher than the full multi-view version would falsify the claim that the multi-view capture is required to handle the stated variations.

Figures

Figures reproduced from arXiv: 1907.11366 by Dong Li, Jinhua Wu, Li Zhang, Yunda Sun, Zhulin Zhang.

**Figure 1.** Figure 1: Baggage ReID application and multi-view camera system at: (a) checkpoint (b) BHS. is the first publicly available baggage ReID dataset, which will enable utilizing deep learning methods on baggage ReID and benefit research and application on general object ReID tasks. Additionally, we also propose baseline models using merged Siamese network with ablation study to understand how baggage ReID performance b… view at source ↗

**Figure 2.** Figure 2: Architecture of merged Siamese network. training data, meanwhile negative training pairs are randomly sampled among different identities, forming a training set balanced in positive and negative labels. The merged Siamese network is firstly trained on this balanced training set with a few epochs. Then the output model is utilized to inference each probe among 300 identities randomly sampled from 4019 ident… view at source ↗

**Figure 3.** Figure 3: Sample ReID results on MVB. Probe and Gallery images are not masked. Probe images are listed in the left in blue box. Gallery images are displayed in order of inferenced possibility. Gallery images with same identity as probe are bounded in green box, otherwise in red. (a) samples of baggage re-identified in top 3, (b) samples of baggage not re-identified in top 3. 6 Conclusion A new baggage ReID dataset n… view at source ↗

read the original abstract

In this paper, we present a novel dataset named MVB (Multi View Baggage) for baggage ReID task which has some essential differences from person ReID. The features of MVB are three-fold. First, MVB is the first publicly released large-scale dataset that contains 4519 baggage identities and 22660 annotated baggage images as well as its surface material labels. Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible. Third, MVB has remarkable inter-class similarity and intra-class dissimilarity, considering the fact that baggage might have very similar appearance while the data is collected in two real airport environments, where imaging factors varies significantly from each other. Moreover, we proposed a merged Siamese network as baseline model and evaluated its performance. Experiments and case study are conducted on MVB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core contribution is releasing the first large-scale public MVB dataset for baggage re-identification with multi-view images and material labels from real airports.

read the letter

The main thing to know is that this paper releases MVB, a dataset of 4519 baggage identities and 22660 images captured with a multi-view system across two airport settings, plus surface material labels. That setup is new compared to standard person ReID benchmarks and directly targets practical baggage tracking needs where pose changes and occlusions matter. The merged Siamese network baseline is a straightforward adaptation to combine views, and the inter-class similarity plus cross-environment variation are presented as deliberate challenges rather than afterthoughts. The collection process itself looks like the real deliverable here. The abstract gives no numbers on baseline performance, error rates, or ablation results, so the evaluation strength is hard to judge from what's shown. The claim that the multi-view rig captures 3D surface information as completely as possible is stated without supporting details on coverage or failure cases, which leaves some room for skepticism about how well it holds up in practice. Dataset papers like this are mainly useful to groups working on ReID extensions beyond people or on airport security applications. A reader building or testing models on non-standard objects would get value from the scale and the explicit material annotations. It is worth sending to peer review because the dataset release is concrete and the task differences from person ReID are clearly motivated, even if the baseline section needs more metrics to stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the MVB dataset for baggage re-identification, the first public large-scale collection with 4519 identities and 22660 annotated images plus surface material labels. Images are captured via a specially-designed multi-view camera system across two real airport environments to mitigate pose variation and occlusion. The work also proposes a merged Siamese network baseline and reports experiments and case studies on the dataset.

Significance. Release of a dataset at this scale with explicit multi-view capture and material labels would fill a gap in baggage ReID benchmarks, which differ from person ReID due to high inter-class similarity and environmental variation. The baseline provides an initial reference point for future models.

major comments (2)

[Abstract] Abstract: the statement that the merged Siamese network 'evaluated its performance' is unsupported by any numerical results, rank-k accuracies, mAP values, or error bars; without these the baseline contribution cannot be assessed.
[Data collection] Data collection description: the assertion that the multi-view system obtains '3D information of baggage surface as complete as possible' lacks quantitative support such as measured surface coverage percentages or occlusion rates across the two airport environments.

minor comments (1)

Add explicit public release link, license, and download instructions in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript to improve clarity and support for the claims made.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that the merged Siamese network 'evaluated its performance' is unsupported by any numerical results, rank-k accuracies, mAP values, or error bars; without these the baseline contribution cannot be assessed.

Authors: We agree that the abstract would be strengthened by including quantitative results. The full manuscript contains experimental results for the merged Siamese network (including rank-k accuracies and mAP), but these were not summarized in the abstract. In the revision we will add the key performance metrics to the abstract. revision: yes
Referee: [Data collection] Data collection description: the assertion that the multi-view system obtains '3D information of baggage surface as complete as possible' lacks quantitative support such as measured surface coverage percentages or occlusion rates across the two airport environments.

Authors: The phrase was intended to describe the design objective of the multi-view camera rig. We did not perform explicit quantitative measurements of surface coverage or occlusion rates during data collection. We will revise the wording to remove the unsupported quantitative implication while retaining the description of the multi-view capture approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central contribution is the release of the MVB dataset with stated scale (4519 identities, 22660 images), capture method, and material labels, plus a baseline merged Siamese network for evaluation. No derivation chain, equations, fitted parameters presented as predictions, or self-citations are invoked to support load-bearing claims. The dataset introduction stands as an independent empirical contribution without reduction to its own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Contribution centers on empirical data collection rather than mathematical derivation; relies on standard computer vision practices for annotation and evaluation.

axioms (1)

domain assumption Standard assumptions in computer vision for image annotation, multi-view capture, and ReID evaluation hold for baggage images.
The paper builds on typical CV dataset practices without stating novel axioms.

pith-pipeline@v0.9.0 · 5701 in / 1161 out tokens · 22059 ms · 2026-05-24T16:11:46.251675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

[1]

Person Re-identification: Past, Present and Future

Zheng, L., Yang, Y., Hauptmann A. G.: Person re-identiﬁcation: past, present and future. arXiv preprint arXiv: 1610.02984 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

In: CVPR, IEEE, pp

Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identiﬁcation by multi- channel parts-based CNN with improved triplet loss function. In: CVPR, IEEE, pp. 13351344 (2016)

work page 2016
[3]

In: CVPR, IEEE, pp

Liu, H., Tian, Y., Yang, Y., et al.: Deep relative distance learning: tell the diﬀerence be- tween similar vehicles. In: CVPR, IEEE, pp. 21672175 (2016)

work page 2016
[4]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Liu, X., Liu, W., Mei T., et al.: A deep learning-based approach to progressive vehicle re-identiﬁcation for urban surveillance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 869-884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6 53

work page doi:10.1007/978-3-319-46475-6 2016
[5]

ImageNet: a large-scale hierarchical image database

Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: CVPR, IEEE, pp. 248-255 (2009)

work page 2009
[6]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740-755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48

work page doi:10.1007/978-3-319-10602-1 2014
[7]

In: ICCV, IEEE, pp

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re- identiﬁcation: a benchmark. In: ICCV, IEEE, pp. 11161124 (2015)

work page 2015
[8]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identiﬁcation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868-

work page 2016
[9]

https://doi.org/10.1007/978-3-319-46466-4 52

Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4 52

work page doi:10.1007/978-3-319-46466-4 2016
[10]

In: CVPR, IEEE, pp

Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep ﬁlter pairing neural network for person re-identiﬁcation. In: CVPR, IEEE, pp. 152159 (2014)

work page 2014
[11]

In: CVPR, IEEE, pp

Cordts, M., Omran, M., Ramos, S., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, IEEE, pp. 32133223 (2016)

work page 2016
[12]

In: NIPS, pp

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 9199 (2015)

work page 2015
[13]

In: NIPS, pp

Bromley, J., Guyon, I., LeCun, Y., et al.: Signature veriﬁcation using a siamese time de-lay neural network. In: NIPS, pp. 737744 (1994)

work page 1994
[14]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

In: CVPR, IEEE, pp

Hu, J., Shen L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, IEEE, pp. 7132- 7141 (2018)

work page 2018
[16]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioﬀe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reduc- ing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y

Sun, Y., Zheng, L., Yang, Y., et al.: Beyond part models: person retrieval with reﬁned part pooling (and a strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501-518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0 30

work page doi:10.1007/978-3-030-01225-0 2018
[18]

Person transfer GAN to bridge domain gap for person re-identiﬁcation

Wei, L., Zhang, S., Gao, W., et al. Person transfer GAN to bridge domain gap for person re-identiﬁcation. In: CVPR, IEEE, pp. 79-88 (2018)

work page 2018
[19]

In: CVPR, IEEE, pp

Deng, W., Zheng, L., Ye, Q., et al.: Image-image domain adaptation with preserved self- similarity and domain-dissimilarity for person re-identiﬁcation. In: CVPR, IEEE, pp. 994- 1003 (2018)

work page 2018
[20]

In: International Conference on Multimedia, ACM, pp

Wang, G., Yuan, Y., Chen, X., et al.: Learning discriminative features with multiple granu- larities for person re-identiﬁcation. In: International Conference on Multimedia, ACM, pp. 274282 (2018)

work page 2018

[1] [1]

Person Re-identification: Past, Present and Future

Zheng, L., Yang, Y., Hauptmann A. G.: Person re-identiﬁcation: past, present and future. arXiv preprint arXiv: 1610.02984 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

In: CVPR, IEEE, pp

Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identiﬁcation by multi- channel parts-based CNN with improved triplet loss function. In: CVPR, IEEE, pp. 13351344 (2016)

work page 2016

[3] [3]

In: CVPR, IEEE, pp

Liu, H., Tian, Y., Yang, Y., et al.: Deep relative distance learning: tell the diﬀerence be- tween similar vehicles. In: CVPR, IEEE, pp. 21672175 (2016)

work page 2016

[4] [4]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Liu, X., Liu, W., Mei T., et al.: A deep learning-based approach to progressive vehicle re-identiﬁcation for urban surveillance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 869-884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6 53

work page doi:10.1007/978-3-319-46475-6 2016

[5] [5]

ImageNet: a large-scale hierarchical image database

Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: CVPR, IEEE, pp. 248-255 (2009)

work page 2009

[6] [6]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740-755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48

work page doi:10.1007/978-3-319-10602-1 2014

[7] [7]

In: ICCV, IEEE, pp

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re- identiﬁcation: a benchmark. In: ICCV, IEEE, pp. 11161124 (2015)

work page 2015

[8] [8]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identiﬁcation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868-

work page 2016

[9] [9]

https://doi.org/10.1007/978-3-319-46466-4 52

Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4 52

work page doi:10.1007/978-3-319-46466-4 2016

[10] [10]

In: CVPR, IEEE, pp

Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep ﬁlter pairing neural network for person re-identiﬁcation. In: CVPR, IEEE, pp. 152159 (2014)

work page 2014

[11] [11]

In: CVPR, IEEE, pp

Cordts, M., Omran, M., Ramos, S., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, IEEE, pp. 32133223 (2016)

work page 2016

[12] [12]

In: NIPS, pp

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 9199 (2015)

work page 2015

[13] [13]

In: NIPS, pp

Bromley, J., Guyon, I., LeCun, Y., et al.: Signature veriﬁcation using a siamese time de-lay neural network. In: NIPS, pp. 737744 (1994)

work page 1994

[14] [14]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

In: CVPR, IEEE, pp

Hu, J., Shen L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, IEEE, pp. 7132- 7141 (2018)

work page 2018

[16] [16]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioﬀe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reduc- ing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y

Sun, Y., Zheng, L., Yang, Y., et al.: Beyond part models: person retrieval with reﬁned part pooling (and a strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501-518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0 30

work page doi:10.1007/978-3-030-01225-0 2018

[18] [18]

Person transfer GAN to bridge domain gap for person re-identiﬁcation

Wei, L., Zhang, S., Gao, W., et al. Person transfer GAN to bridge domain gap for person re-identiﬁcation. In: CVPR, IEEE, pp. 79-88 (2018)

work page 2018

[19] [19]

In: CVPR, IEEE, pp

Deng, W., Zheng, L., Ye, Q., et al.: Image-image domain adaptation with preserved self- similarity and domain-dissimilarity for person re-identiﬁcation. In: CVPR, IEEE, pp. 994- 1003 (2018)

work page 2018

[20] [20]

In: International Conference on Multimedia, ACM, pp

Wang, G., Yuan, Y., Chen, X., et al.: Learning discriminative features with multiple granu- larities for person re-identiﬁcation. In: International Conference on Multimedia, ACM, pp. 274282 (2018)

work page 2018