Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision
Pith reviewed 2026-05-24 20:53 UTC · model grok-4.3
The pith
Selecting subsets of weakly labeled images lets semantic segmentation networks match full-set accuracy with up to 100 times less data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modeling image representations with a Gaussian Mixture Model finds visually similar images, while counting object instances from bounding boxes finds diverse images; both criteria select small subsets of weakly labeled data that train semantic segmentation CNNs to the same accuracy level as the full sets, enabling reductions of up to 100 times on Open Images and 20 times on Cityscapes.
What carries the argument
Gaussian Mixture Model fitted to image feature representations for similarity-based selection, together with object-count diversity measured from bounding boxes; these act as filters that reduce the weak training set before the segmentation network is trained.
If this is right
- The GMM method requires no labels at all, only raw image features.
- The diversity method needs only the bounding-box annotations already present.
- Accuracy stays level even after cutting the weak data volume by the reported factors on both datasets.
- GMM fitting also yields direct descriptions of the underlying image distribution.
Where Pith is reading between the lines
- The two selection rules could be applied together to form even smaller yet still sufficient subsets.
- The same filtering logic might transfer to other tasks that rely on bounding-box weak labels, such as object detection.
- Lower data volume would also cut the compute time and memory needed for each training run.
Load-bearing premise
The chosen small subsets still hold enough variety for the network to learn the same pixel-level class distinctions that the full weak collection would provide.
What would settle it
Train identical segmentation networks on the selected reduced sets versus the full weak sets and check whether mean intersection-over-union on a fixed test set drops below the full-set result.
Figures
read the original abstract
Training convolutional networks for semantic segmentation with strong (per-pixel) and weak (per-bounding-box) supervision requires a large amount of weakly labeled data. We propose two methods for selecting the most relevant data with weak supervision. The first method is designed for finding visually similar images without the need of labels and is based on modeling image representations with a Gaussian Mixture Model (GMM). As a byproduct of GMM modeling, we present useful insights on characterizing the data generating distribution. The second method aims at finding images with high object diversity and requires only the bounding box labels. Both methods are developed in the context of automated driving and experimentation is conducted on Cityscapes and Open Images datasets. We demonstrate performance gains by reducing the amount of employed weakly labeled images up to 100 times for Open Images and up to 20 times for Cityscapes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two methods for selecting subsets of weakly labeled (bounding-box) images to train semantic segmentation CNNs: (1) GMM modeling of global image representations to identify visually similar images without using labels, and (2) a bounding-box-based selection for images with high object diversity. Experiments are performed in the automated-driving setting on Cityscapes and Open Images; the central claim is that these selections yield performance gains while reducing the weakly labeled training data by up to 20× (Cityscapes) and 100× (Open Images).
Significance. If the experimental results demonstrate that the reduced subsets maintain segmentation accuracy comparable to the full weak-supervision set, the work would be significant for reducing annotation and compute costs in large-scale semantic segmentation. The GMM byproduct insights on characterizing the data-generating distribution could also be useful for dataset analysis.
major comments (1)
- [Abstract and Methods] Abstract and Methods: the claim that GMM-selected subsets (and diversity-selected ones) allow a segmentation CNN to reach performance comparable to the full weak set is load-bearing, yet the method operates solely on global image embeddings. Nothing in the selection guarantees preservation of semantic class frequencies or spatial contexts; in automated-driving data, global features often correlate with scene style or illumination rather than object-class presence. If rare classes (e.g., traffic signs, cyclists) are under-represented, reported gains cannot be attributed to the selection preserving information content.
minor comments (1)
- [Abstract] The abstract states performance gains but supplies no quantitative numbers, baselines, error bars, or dataset splits, making it impossible to verify whether the claimed reductions actually preserve accuracy.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to incorporate additional analysis as outlined.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: the claim that GMM-selected subsets (and diversity-selected ones) allow a segmentation CNN to reach performance comparable to the full weak set is load-bearing, yet the method operates solely on global image embeddings. Nothing in the selection guarantees preservation of semantic class frequencies or spatial contexts; in automated-driving data, global features often correlate with scene style or illumination rather than object-class presence. If rare classes (e.g., traffic signs, cyclists) are under-represented, reported gains cannot be attributed to the selection preserving information content.
Authors: We agree that the GMM-based selection using global image embeddings provides no explicit guarantee of preserving semantic class frequencies or spatial contexts, and that global features in driving scenes may correlate more with style or illumination than with object presence. This is a substantive methodological limitation. Our defense rests on the empirical results: the selected subsets achieve segmentation performance comparable to the full weak-supervision set despite the large reductions (20× on Cityscapes, 100× on Open Images). These outcomes indicate that the visual similarity modeled by the GMM selects sufficiently informative images in practice for this task and these datasets. To directly address the concern, we will add an analysis of per-class frequencies (including rare classes such as traffic signs and cyclists) in the GMM-selected and diversity-selected subsets versus the full sets, to be included in the revised manuscript. revision: yes
Circularity Check
No circularity detected; empirical methods with no derivations
full rationale
The paper describes two empirical data selection procedures (GMM modeling of image representations and bounding-box diversity counting) and reports experimental performance gains on Cityscapes and Open Images. No equations, derivations, predictions, or first-principles results are present in the provided text. Claims rest on standard statistical tools applied to external data rather than any self-definitional reduction, fitted-input renaming, or load-bearing self-citation chain. The work is therefore self-contained against external benchmarks with no circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Image representations modeled by GMM capture visual similarity relevant to semantic segmentation performance
- domain assumption Higher object diversity (measured by bounding boxes) improves training data quality for segmentation
Reference graph
Works this paper leans on
-
[1]
Semantic segmentation via multi-task, multi-domain learn- ing,
D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tr ´emeau, and C. Wolf, “Semantic segmentation via multi-task, multi-domain learn- ing,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) . Springer, 2016, pp. 333–343
work page 2016
-
[2]
P. Meletis and G. Dubbelman, “Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmen- tation,” in 2018 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2018, pp. 1045–1050
work page 2018
-
[3]
A. Geiger and et. al., “Robust vision challenge,” http://robustvision. net/index.php, 2018, [Online; accessed 12-April-2019]
work page 2018
-
[4]
Fully convolutional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440
work page 2015
-
[5]
Learning semantic segmentation with diverse supervision,
L. Ye, Z. Liu, and Y . Wang, “Learning semantic segmentation with diverse supervision,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 2018, pp. 1461–1469
work page 2018
-
[6]
Learning to segment under various forms of weak supervision,
J. Xu, A. G. Schwing, and R. Urtasun, “Learning to segment under various forms of weak supervision,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3781–3790
work page 2015
-
[7]
Learning specific- class segmentation from diverse data,
M. P. Kumar, H. Turki, D. Preston, and D. Koller, “Learning specific- class segmentation from diverse data,” in 2011 International Confer- ence on Computer Vision . IEEE, 2011, pp. 1800–1807
work page 2011
-
[8]
Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need
V . Birodkar, H. Mobahi, and S. Bengio, “Semantic redundancies in image-classification datasets: The 10% you don’t need,”arXiv preprint arXiv:1901.11409, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[9]
Are All Training Examples Created Equal? An Empirical Study
K. V odrahalli, K. Li, and J. Malik, “Are all training examples created equal? an empirical study,” arXiv preprint arXiv:1811.12569 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Pixel level data augmentation for semantic image segmentation using generative adversarial networks,
S. Liu, J. Zhang, Y . Chen, Y . Liu, Z. Qin, and T. Wan, “Pixel level data augmentation for semantic image segmentation using generative adversarial networks,” arXiv preprint arXiv:1811.00174 , 2018
-
[11]
“Implementation code for selection methods, inference and all mod- els will be made publicly available at https://github.com/pmeletis/ data-selection.”
-
[12]
The cityscapes dataset for semantic urban scene understanding,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016
work page 2016
-
[13]
A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. , “The open images dataset v4: Unified image classification, object de- tection, and visual relationship detection at scale,” arXiv preprint arXiv:1811.00982, 2018
-
[14]
On Boosting Semantic Street Scene Segmentation with Weak Supervision
P. Meletis and G. Dubbelman, “On boosting semantic street scene seg- mentation with weak supervision,” arXiv preprint arXiv:1903.03462 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[15]
Image retrieval using gaussian mixture models,
Z. Robotka and A. Zempl ´eni, “Image retrieval using gaussian mixture models,” Annals Univ. Sci. Budapest, Sect. Comp , vol. 31, pp. 93–105, 2009
work page 2009
-
[16]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114 , 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[17]
Adversarially Learned Inference
V . Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in neural information processing systems , 2014, pp. 2672– 2680
work page 2014
-
[19]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057 , 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[20]
Taskonomy: Disentangling task transfer learning,
A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 3712–3722
work page 2018
-
[21]
Representation learning: A review and new perspectives,
Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013
work page 2013
-
[22]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778
work page 2016
-
[23]
G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, “Finite mixture models,” Annual review of statistics and its application , vol. 6, pp. 355–378, 2019
work page 2019
-
[24]
Scikit-learn: Machine learning in Python,
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Van- derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011
work page 2011
-
[25]
L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008
work page 2008
-
[26]
Variational Inference with Normalizing Flows
D. J. Rezende and S. Mohamed, “Variational inference with normal- izing flows,” arXiv preprint arXiv:1505.05770 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Remarks on some nonparametric estimates of a density function,
M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics , pp. 832–837, 1956
work page 1956
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.