MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
Pith reviewed 2026-05-24 16:11 UTC · model grok-4.3
The pith
MVB is the first large-scale public dataset for baggage re-identification, with 4519 identities captured via multi-view cameras across two real airports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors release MVB, the first publicly available large-scale dataset for baggage ReID containing 4519 identities and 22660 images together with surface material labels; all images come from a specially-designed multi-view camera system deployed in two real airport environments that differ markedly in imaging factors, with the system intended to obtain baggage surface 3D information as completely as possible in the presence of pose variation and occlusion. They further introduce a merged Siamese network baseline and report its evaluation results on the dataset.
What carries the argument
The specially-designed multi-view camera system that captures baggage from multiple angles to assemble more complete surface information despite pose and occlusion.
If this is right
- Re-identification models can now be trained and tested on baggage data that includes both inter-class similarity and cross-environment imaging differences.
- Surface material labels become available as an auxiliary signal for distinguishing visually similar items.
- Benchmarks exist for merged Siamese architectures on this specific object category.
- The dataset supports evaluation of methods that must generalize across two distinct real-world capture conditions.
Where Pith is reading between the lines
- The same multi-view capture strategy could be applied to other rigid objects that must be tracked across camera networks.
- Material labels might support downstream tasks such as automated sorting by surface type in logistics settings.
- Performance gaps between the baseline and future models on MVB would quantify how much the baggage domain still differs from person re-identification.
- The two-environment design supplies a ready test bed for domain-adaptation techniques without needing new data collection.
Load-bearing premise
The multi-view camera system actually gathers enough additional surface information to overcome the pose and occlusion problems that occur in the two airport environments.
What would settle it
A controlled experiment in which single-view subsets of MVB yield re-identification accuracy equal to or higher than the full multi-view version would falsify the claim that the multi-view capture is required to handle the stated variations.
Figures
read the original abstract
In this paper, we present a novel dataset named MVB (Multi View Baggage) for baggage ReID task which has some essential differences from person ReID. The features of MVB are three-fold. First, MVB is the first publicly released large-scale dataset that contains 4519 baggage identities and 22660 annotated baggage images as well as its surface material labels. Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible. Third, MVB has remarkable inter-class similarity and intra-class dissimilarity, considering the fact that baggage might have very similar appearance while the data is collected in two real airport environments, where imaging factors varies significantly from each other. Moreover, we proposed a merged Siamese network as baseline model and evaluated its performance. Experiments and case study are conducted on MVB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the MVB dataset for baggage re-identification, the first public large-scale collection with 4519 identities and 22660 annotated images plus surface material labels. Images are captured via a specially-designed multi-view camera system across two real airport environments to mitigate pose variation and occlusion. The work also proposes a merged Siamese network baseline and reports experiments and case studies on the dataset.
Significance. Release of a dataset at this scale with explicit multi-view capture and material labels would fill a gap in baggage ReID benchmarks, which differ from person ReID due to high inter-class similarity and environmental variation. The baseline provides an initial reference point for future models.
major comments (2)
- [Abstract] Abstract: the statement that the merged Siamese network 'evaluated its performance' is unsupported by any numerical results, rank-k accuracies, mAP values, or error bars; without these the baseline contribution cannot be assessed.
- [Data collection] Data collection description: the assertion that the multi-view system obtains '3D information of baggage surface as complete as possible' lacks quantitative support such as measured surface coverage percentages or occlusion rates across the two airport environments.
minor comments (1)
- Add explicit public release link, license, and download instructions in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript to improve clarity and support for the claims made.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that the merged Siamese network 'evaluated its performance' is unsupported by any numerical results, rank-k accuracies, mAP values, or error bars; without these the baseline contribution cannot be assessed.
Authors: We agree that the abstract would be strengthened by including quantitative results. The full manuscript contains experimental results for the merged Siamese network (including rank-k accuracies and mAP), but these were not summarized in the abstract. In the revision we will add the key performance metrics to the abstract. revision: yes
-
Referee: [Data collection] Data collection description: the assertion that the multi-view system obtains '3D information of baggage surface as complete as possible' lacks quantitative support such as measured surface coverage percentages or occlusion rates across the two airport environments.
Authors: The phrase was intended to describe the design objective of the multi-view camera rig. We did not perform explicit quantitative measurements of surface coverage or occlusion rates during data collection. We will revise the wording to remove the unsupported quantitative implication while retaining the description of the multi-view capture approach. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central contribution is the release of the MVB dataset with stated scale (4519 identities, 22660 images), capture method, and material labels, plus a baseline merged Siamese network for evaluation. No derivation chain, equations, fitted parameters presented as predictions, or self-citations are invoked to support load-bearing claims. The dataset introduction stands as an independent empirical contribution without reduction to its own inputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in computer vision for image annotation, multi-view capture, and ReID evaluation hold for baggage images.
Reference graph
Works this paper leans on
-
[1]
Person Re-identification: Past, Present and Future
Zheng, L., Yang, Y., Hauptmann A. G.: Person re-identification: past, present and future. arXiv preprint arXiv: 1610.02984 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi- channel parts-based CNN with improved triplet loss function. In: CVPR, IEEE, pp. 13351344 (2016)
work page 2016
-
[3]
Liu, H., Tian, Y., Yang, Y., et al.: Deep relative distance learning: tell the difference be- tween similar vehicles. In: CVPR, IEEE, pp. 21672175 (2016)
work page 2016
-
[4]
In: Leibe, B., Matas, J., Sebe, N., Welling, M
Liu, X., Liu, W., Mei T., et al.: A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 869-884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6 53
-
[5]
ImageNet: a large-scale hierarchical image database
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: CVPR, IEEE, pp. 248-255 (2009)
work page 2009
-
[6]
In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740-755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48
-
[7]
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re- identification: a benchmark. In: ICCV, IEEE, pp. 11161124 (2015)
work page 2015
-
[8]
In: Leibe, B., Matas, J., Sebe, N., Welling, M
Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868-
work page 2016
-
[9]
https://doi.org/10.1007/978-3-319-46466-4 52
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4 52
-
[10]
Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: CVPR, IEEE, pp. 152159 (2014)
work page 2014
-
[11]
Cordts, M., Omran, M., Ramos, S., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, IEEE, pp. 32133223 (2016)
work page 2016
-
[12]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 9199 (2015)
work page 2015
-
[13]
Bromley, J., Guyon, I., LeCun, Y., et al.: Signature verification using a siamese time de-lay neural network. In: NIPS, pp. 737744 (1994)
work page 1994
-
[14]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[15]
Hu, J., Shen L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, IEEE, pp. 7132- 7141 (2018)
work page 2018
-
[16]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reduc- ing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y
Sun, Y., Zheng, L., Yang, Y., et al.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501-518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0 30
-
[18]
Person transfer GAN to bridge domain gap for person re-identification
Wei, L., Zhang, S., Gao, W., et al. Person transfer GAN to bridge domain gap for person re-identification. In: CVPR, IEEE, pp. 79-88 (2018)
work page 2018
-
[19]
Deng, W., Zheng, L., Ye, Q., et al.: Image-image domain adaptation with preserved self- similarity and domain-dissimilarity for person re-identification. In: CVPR, IEEE, pp. 994- 1003 (2018)
work page 2018
-
[20]
In: International Conference on Multimedia, ACM, pp
Wang, G., Yuan, Y., Chen, X., et al.: Learning discriminative features with multiple granu- larities for person re-identification. In: International Conference on Multimedia, ACM, pp. 274282 (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.